1 Introduction

Aquatic environments occupy 71% of Earth’s surface and provide important ecosystem services [22], are a growing source of food production [17], and help regulate climate [87]. However, monitoring these environments poses significant challenges due to their inaccessibility, scale, and dynamic nature. Stakeholders often employ underwater camera systems—stationary, remotely operated, or autonomous—to gain insights into otherwise inaccessible aquatic environments. The escalating amount of data collected in this way requires autonomous processing using computer vision and machine learning. There are numerous applications for computer vision in environmental monitoring [3, 15, 61, 85], underwater exploration [39, 40], and food production [49, 101, 111]. However, underwater environments present a unique set of challenges including variable turbidity, colour casts, and light conditions. These variable conditions necessitate development of specialised underwater computer vision models.

The success of deep learning, especially convolutional neural networks (CNNs) [42] in the field of computer vision, gave also rise to fast and accurate object detection algorithms such as Faster Region-based CNN (Faster R-CNN) [80] and You Only Look Once (YOLO) [79]. Object detection deals with localizing objects in an image and classifying identified objects into predefined categories (classes) [73]. Many underwater object detection models have been developed to detect different marine organisms by fine-tuning and engineering new layers for Faster R-CNN and YOLO or by introducing advanced data augmentation [30, 64, 65, 70, 83]. Despite their popularity, how well these models respond to domain shifts, that is, their ability to generalize to different water turbidity, colour casts, or light conditions remains relatively understudied [10, 72].

Datasets for underwater object detection models are often collected as videos which are then split into individual frames (images) for training and evaluation, with the number of extracted images exceeding the number of collected videos by orders of magnitudes. The evaluation is commonly performed by randomly splitting the collected images into a training set and a test set, which means that different parts of the same video sequence are used in both training and evaluation. This approach can overestimate detection accuracy because it neglects that water turbidity, colour casts, and light conditions change rapidly across time, locations, and ecological habitats. We argue that to make underwater object detection more useful in applied settings, the evaluation needs to account for the domain shifts inherent to aquatic environments. That is, depending on the application, there should be no overlap between a training set and a test set in terms of video identity, date and location of recording, or ecological habitats.

Many image processing and enhancement techniques, both general [28, 35, 77, 81, 110] and specific to underwater environments [24, 31, 105], have been developed to improve the visual quality of images by adjusting contrast and colour balance, enhancing the dynamic range, removing haze, or adjusting for light attenuation in underwater imagery. It has been suggested that these methods could improve downstream tasks, including underwater object detection [31, 105]. Nonetheless, several studies [34, 59, 100, 103] resulted in no or only small improvements in detection accuracy. These studies tested the effects of image enhancement without considering domain shift. We hypothesise that image processing and enhancement methods could provide greater accuracy improvements when models are evaluated for their ability to generalize to domains outside of the training set by reducing the complexity caused by distinct domain-associated visual features. This idea is related to Kolmogorov complexity [8, 41], which defines the complexity of an object as the length of its shortest possible description. From this perspective, image processing and enhancement techniques could reduce data complexity by diminishing the domain-associated visual features. As a result, a lower-complexity data may result in improved learning and inference [37]. To test our hypothesis, we propose a framework for combating domain shift in underwater object detection with image enhancement and evaluate 14 different image processing and enhancement methods across three diverse aquatic datasets and two object detection algorithms. The contributions of this paper can be summarised as follows:

  1. 1.

    We proposed a data-centric framework for evaluating domain shift in underwater object detection with a robust and comprehensive cross-validation approach.

  2. 2.

    The framework identified a considerable disparity in underwater object detection performance depending on whether within-domain or out-of-domain evaluation was used. This disparity has profound implications on how underwater object detectors should be evaluated in practice.

  3. 3.

    To assess the effectiveness of image enhancement for underwater domain generalization, we proposed a measure based on silhouette scores and demonstrated its correlation with out-of-domain detection accuracy.

  4. 4.

    We provided empirical evidence that a fast implementation of MSRCR with a limited kernel size is one of the most effective enhancement methods for combating domain shift, consistently across datasets and detection algorithms.

2 Related work

2.1 Underwater object detection

Because of the unique challenges in aquatic environments, many object detection algorithms have been specifically tailored for detecting underwater objects. The vast majority of underwater object detection models have been derived from the Faster R-CNN [30, 64, 83] and YOLO [26, 58, 60, 62, 63, 65, 70] architectures. Some have been optimised to detect specific object categories such as different marine benthic organisms [62] or jellyfish species [26, 83], while others have been designed to address underwater challenges common across various target categories. To this end, model-centric approaches have been proposed to enhance model architecture. Muksit et al. [70] optimised feature map upsampling to improve detection of small objects such as small fish. Liu et al. [58] and Gao et al. [26] integrated new attention modules to improve detection of blurred objects, a frequent result of forward scatter in underwater image acquisition. Another method inspired by the phenomena of underwater image acquisition introduced a new loss function with a criterion incorporating a physical model of underwater illumination [63]. In contrast to the model-centric approaches, Huang et al. [30] proposed a data-centric approach that introduced data augmentation techniques based on the inverse process of underwater image restoration to simulate different levels of water turbulence, thereby improving training diversity. A common topic in underwater object detection is the use of image enhancement. Most often, image enhancement has been incorporated as an auxiliary regularisation task, in which case the model is trained to restore a corrupted image and detect objects in joined fashion [11, 23, 60, 102]. In a different way, Dai et al. [18] developed a detection architecture that extracts features from both original and enhanced images using gated fusion, allowing adaptive feature extraction from different sources. Despite these advancements, none of the aforementioned studies evaluated detection accuracy under domain shifts, which is the focus of our work.

Underwater domain generalization remains relatively understudied. Ottaviani et al. [72] performed a longitudinal study at the Western Mediterranean Expandable Seafloor Observatory and observed that detection accuracy of fish progressively decreased over the duration of their study. Liu et al. [57] proposed DG-YOLO, a domain generalization approach, which combined domain alignment with a data augmentation technique called water quality transfer (WQT) to synthetically increase diversity of training domains. Most recently, Chen et al. [10] proposed Domain Mixup and Contrastive Learning (DMCL) and a dataset, referred to as S-UODAC, of marine benthic organisms with seven synthetic domains. DMCL is a model-centric method, which uses domain sampling and semantic consistency to achieve the state-of-the-art results in domain generalization for underwater object detection. We compare the accuracy of our data-centric approach with DMCL on the S-UODAC dataset.

2.2 Underwater image enhancement

Image processing and enhancement methods aim to improve visual quality of images by adjusting brightness, contrast, colour balance, reducing noise, sharpening edges, and removing artifacts such as haze. In aquatic environments, the main focus of image enhancement is to counteract the selective light attenuation, which causes loss of colour accuracy; the forward scatter, which results in blurring; and the backscatter, which causes a haze-like appearance, reducing clarity and contrast. Underwater image enhancement methods vary in approach: some rely on physical models [2, 32] to correct for light attenuation and scattering [12, 24, 38, 48, 86, 98, 105, 106], while others use deep learning architectures such as CNNs [46], generative adversarial networks (GANs) [14, 31, 45, 51], or transformers [76]. Hybrid methods have also emerged that combine both paradigms, leveraging physical insights to constraint training of deep learning models [21, 47, 75]. Although the deep learning approaches tend to be faster than the physical ones, they have been mostly trained and evaluated using images of a relatively low resolution, typically \(256\times 256\), which limits their applicability to downstream tasks such as underwater object detection.

Several studies evaluated popular image processing and enhancement methods on underwater imagery using various quantitative measures to assess the visual quality of the enhanced images and the effect of image enhancement on downstream tasks such as object detection [9, 34, 59, 100, 103]. There are two key differences between those studies and our work. Firstly, previous studies used a single dataset or several related datasets of limited diversity. Specifically, the studies focused on detecting only five categories of benthic marine organisms (sea urchins, sea cucumbers, starfish, scallops, and seaweeds) and used relatively small test sets of 300–1,111 images. With limited diversity and relatively small test sets, there was insufficient evidence that the conclusions were transferable to other aquatic environments and organisms. Secondly, except for Chen et al. [9], previous studies did not evaluate domain generalization, which is the focus of our proposed framework. That is, the previous studies used random splitting to create the training and test sets and, therefore, did not account for domain shifts due to variable turbidity, colour casts, and light conditions. Chen et al. [9] compared two image restoration methods in the context of a synthetically created domain shift. However, their design was circular because enhancement methods were applied to restore synthetically manipulated images rather than images from real-world domains. Additionally, their domain generalization evaluation was not quantitative and was comprised of selected example images in which image restoration improved detections. In contrast to the previous studies, we propose a framework to test domain generalization performance of image processing and enhancement methods in the context of underwater object detection using a robust and comprehensive cross-validation approach applied to three diverse, real-world, aquatic datasets.

Fig. 1
figure 1

The proposed framework for testing domain generalization in underwater object detection. An application of the framework to identify an image enhancement method efficient in combating domain shifts is shown here

2.3 Domain generalization

Domain generalization deals with the design and evaluation of machine learning models that can generalize across different but related domains, which were not encountered during training. Domain generalization thus ensures that models perform robustly in real-world scenarios susceptible to domain shifts (variable data distributions), such as those caused by changes in environmental conditions. Different approaches have been proposed for domain generalization of computer vision models [93, 109]. Most of these approaches deal with image classification (object recognition) and introduce techniques based on domain alignment, data augmentation, auxiliary learning tasks, meta-learning, and optimisation objectives. Domain alignment aims to learn domain-invariant features that remain consistent across different domains [25, 50, 66, 69, 89, 90]. Data augmentation works towards the same goal by increasing the diversity of training domains by synthetic manipulation of the inputs [52, 88, 91, 108]. The inclusion of auxiliary learning tasks is a type of regularisation that encourages the model to learn more abstract and thus generalizable representations [7, 95]. Meta-learning represents a distinct approach based on the learning-to-learn framework, which adopts episodic training paradigm, i.e. the training domains are split into meta-train and meta-test at each iteration to simulate domain shift [4, 20, 33]. It has been shown that domain generalization can be improved also by introducing new optimisation objectives such as the sharpness-aware gradient matching [78, 94]. Additionally, some image classification methods focus on a specific case of unsupervised domain generalization [27, 107], in which the majority of source (training) domains contain unlabelled data.

Much less attention has been paid to domain generalization for object detection (however, some of the approaches for image classification can be applied to object detection). Lin et al. [54] proposed a domain-invariant disentanglement network, a type of a domain alignment approach, which works on both global- and instance-level representations for generalizable object detection. Similarly, Wu and Deng [97] employed both global- and instance-level contrastive loss and introduced cyclic-disentanglement capable of generalizing to unseen target domains after being trained on a single source domain. Lee et al. [44] combined object-aware data augmentations and object-aware contrastive loss for domain alignment.

The majority of the domain generalization methods, aside from the data augmentation techniques, are model-centric approaches, which modify the model architecture or training to improve generalization. In contrast, our framework offers a straightforward data-centric approach. The simplicity of pre-processing inputs with image enhancement can be seamlessly integrated with off-the-shelf object detectors, making it both versatile and easy to implement across various detection tasks.

3 Methodology

We designed a data-centric framework (Fig. 1) for testing domain generalization in underwater object detection with a robust and comprehensive cross-validation approach. To demonstrate the usefulness of the proposed framework, we used it to evaluate 14 image processing and enhancement methods. The framework allowed us to identify image enhancement methods that could robustly and consistently deliver improvements in terms of detection accuracy across several datasets.

Table 1 Underwater object detection datasets

3.1 Proposed framework

Our framework focuses on delivering replicable, robust, application-orientated results (Fig. 1). Therefore, the framework requires a minimum of two distinct datasets. The first dataset, referred to as the discovery dataset, is used for design, optimisation, and model selection, while the second dataset, referred to as the replication dataset, is used for an independent assessment to test if the selected method’s performance can be reproduced. More than one dataset can be used in both the discovery stage and the replication stage.

Domain shift is inherent to aquatic environments [19, 72]. Therefore, we propose to always split datasets into subsets (such as training and test sets or cross-validation folds) based on domains rather than individual images. That is, domain-based splitting guarantees that all images from a given domain are entirely contained within a single data subset.

Given that most underwater object detection datasets typically contain less than 10,000 images [56, 85], our framework utilises a cross-validation approach. Cross-validation splits images of the given dataset into a chosen number of non-overlapping groups (folds). Each fold is used for testing once, while the other folds are concatenated and used for training. This comprehensive approach thus uses every single image to estimate the detection performance without comprising the training procedure. The framework further splits the union of folds dedicated to training into training and validation subsets (again using domain-based splitting). The validation set can be used for hyperparameter tuning, early stopping and model selection.

Considering the variation in the quantity of available images across different domains, the domain-based splitting may result in non-uniformly sized cross-validation folds, which in turn may result in substantial variability of the estimated detection accuracy. Therefore, in our framework, each cross-validation procedure is repeated three times with different random seeds controlling both cross-validation splits and neural network parameter initiation. Finally, the detection accuracy is evaluated by concatenating detections from all test folds, and the mean and standard deviation across the three random seeds is used to select the best method. This concludes the discovery stage of the framework. In the replication stage, three randomly seeded replicates of domain-based cross-validation are used to train and evaluate the performance of the best method on the replication datasets.

3.2 Datasets

Three publicly available underwater object detection datasets were used: DeepFish, Moreton Bay Environmental Education Centre Low Visibility dataset (MBEEC-Low-Vis), and Jellytoring (Table 1). DeepFish was used as the discovery dataset, and MBEEC-Low-Vis and Jellytoring were used as the replication datasets (Fig. 1). Additionally, we used the Synthetic Underwater Object Detection Algorithm Contest (S-UODAC) dataset to compare with previously reported results of existing approaches for underwater domain generalization.

We used the extended version [70] of the original DeepFish dataset [84], which contains 4,505 images with only one category referred to as “fish” (a catch-all label for multiple different species). Of the four datasets, DeepFish contains the most diverse habitats—20 in total—including mangroves, boulders, coral reef, and seagrass. The habitats were used as domains in our experiments.

The full MBEEC-Low-Vis dataset, which contains 19,041 images with 19 categories of fish species, was filtered to retain the six species that had more than 5,000 annotations, reducing the number of images to 16,540. The six species were Australasian snapper, eastern striped grunter, paradise threadfin bream, smallmouth scad, smooth golden toadfish, and yellowfin bream/tarwhine. Bounding box annotations for fish which did not belong to one of the six species were removed. Given that the original dataset was extracted from videos with five images per second, the redundancy was reduced by subsampling to a single image per second. Thus, the final dataset contained 3,327 images. The images were assigned to 24 domains based on the location and date of the video recording. While this definition of a domain is imperfect, the variability across location and date of these recordings was sufficient upon examining image clusters of the data’s first two principal components.

The Jellytoring 2.0 dataset [83] contains 2,886 images with 15 jellyfish species, including Aurelia aurita, Pelagia noctiluca, Chrysaora hysoscella, and Cotylorhiza tuberculata. Most of this dataset was compiled from publicly available video recordings, and thus the only information that could be used for assigning domains to images was the video recording identity, which resulted in 188 domains.

The DeepFish, MBEEC-Low-Vis, and Jellytoring datasets were split into training, validation, and test splits using a three-fold cross-validation with domain-based splitting in three randomly seeded replicates (see Section 3.1 for details). The training, validation, and test splits proportions ranged from 40%–64%, 10%–24%, and 26%–40%, respectively, and the numbers of domains in the training splits ranged from 7–10, 6–15, and 72–90 for DeepFish, MBEEC-Low-Vis, and Jellytoring, respectively. Additionally, to empirically demonstrate the need to use domain-based splitting in the proposed framework, we split DeepFish, MBEEC-Low-Vis, and Jellytoring into cross-validation folds also with random splitting, i.e. images were randomly split into cross-validation folds without considering the domains associated with the images. With random splitting, the training, validation, and test splits proportions ranged from 46%–47%, 20%–21%, and 33%–34%, respectively, and all domains were included in the training splits for DeepFish and MBEEC-Low-Vis. For Jellytoring, the number of training domains ranged from 180–185.

To compare our approach with model-centric methods for domain generalization and underwater object detection, we used the S-UODAC dataset [10], which was compiled by splitting the Underwater Object Detection Algorithm Contest (UODAC) 2020 dataset into seven equal parts and employing neural style transfer to synthetically manipulate domain information in each part. S-UODAC comprises seven domains and 5,454 images of four different marine organisms: echinoids (sea urchins), starfish, holothurian (sea cucumber), and scallop.

3.3 Evaluation metrics

Mean average precision (mAP), calculated according to the COCO evaluation standard, was used as the detection accuracy metric. Precision (P) represents the fraction of correct detections, and recall (R) represents the fraction of objects detected:

$$\begin{aligned} P=\frac{TP}{TP+FP} \hspace{1cm} R=\frac{TP}{TP+FN} \end{aligned}$$

where TP, FP, and FN are the number of true positives (correct detections), false positives (detection where there was no object), and false negatives (missed detections), respectively. A correct detection was defined for a given intersection over union (IoU) threshold, which is the ratio of the intersection of the ground truth and predicted bounding boxes’ areas to their combined areas.

The average precision (AP) for a given object category (class) k was interpolated across 101 recall values (ranging from 0 to 1 in 0.01 increments):

$$\begin{aligned} \text {AP}(k)=\frac{1}{101} \sum _{r \in \{0.0, \ldots , 1.0\}}{\max _{\tilde{r} \ge r} p(\tilde{r})} \end{aligned}$$

where \(p(\tilde{r})\) is the precision value at the recall value \(\tilde{r}\). Finally, mAP was the mean of AP values across all object categories (C) in the given dataset:

$$\begin{aligned} \text {mAP}=\frac{1}{|C|} \sum _{k \in C}{\text {AP}(k)} \end{aligned}$$

We used mAP at the IoU threshold of 0.5 (mAP\(_{50}\)) as the main evaluation metric. Additionally, we used mAP\(_{50-95}\) for implementing early stopping and model selection, which was calculated as the mean of mAPs at IoU thresholds ranging from 0.50–0.95 in 0.05 increments. When comparing image enhancement and processing methods to a baseline of using raw images with no processing, we defined \(\Delta \text {mAP}_{50} = \text {mAP}_{50}^{\text {enhanced}} - \text {mAP}_{50}^{\text {raw}}\).

3.4 Object detection algorithms and training

We used two object detection algorithms: Faster R-CNN [80] and YOLO [79]. Faster R-CNN is a two-stage model, which first identifies potential objects in an image, and second, classifies the identified objects into one of the predefined labels (including a ‘not-an-object’ label). In contrast, YOLO is a one-stage model, which treats detection as a regression problem of predicting bounding boxes and their corresponding labels. We used the Faster R-CNN implementation provided by the Detectron2 package [99] employing the ResNet-50 with a feature pyramid network architecture (faster_rcnn_R_50_FPN_3x) containing 42 million parameters. For YOLO, the implementation referred to as YOLOv8 from the Ultralytics package [36], specifically the large architecture (YOLOv8l) containing 43.7 million parameters, was used. For both Faster R-CNN and YOLOv8, model weights pretrained on the COCO dataset [55] were used to initiate the training.

Table 2 Image processing and enhancement methods

Batch sizes were set to 4 and 5 images for Faster R-CNN and YOLOv8, respectively. For both methods, we set the detection confidence to 0, maximum number of detections per image to 100 (more than the maximum number of objects, 66, in any of our datasets), and the non-maximum suppression (NMS) threshold to 0.7. The NMS threshold of 0.7 is the default value in YOLOv8 as well as in Detectron2’s implementation of the Faster R-CNN’s region-proposal network. We experimented also with the NMS threshold of 0.5 and saw only minimal mAP\(_{50}\) improvements of 0.7 and 0.6 percentage points for Faster R-CNN and YOLOv8, respectively, when evaluated using the discovery (DeepFish) dataset. All other hyper-parameters, unless stated otherwise, were kept unchanged for both Faster R-CNN and YOLOv8 (including input image sizes of a maximum \(1333 \times 800\) and \(640 \times 640\) pixels, respectively) to emulate the effectiveness of the tested image enhancement methods without additional fine-tuning of the underlying object detection architectures.

Two different model selection procedures were used to avoid over-fitting to the source (training) domains. For Faster R-CNN, models were trained for a maximum of 100 epochs, calculating mAP\(_{50-95}\) on the validation set at the end of every epoch. If mAP\(_{50-95}\) was not improved for 10 consecutives epochs, the training was terminated and the model with best mAP\(_{50-95}\) was selected. As a result of this procedure, training of Faster R-CNN models ranged from 8–20 epochs on average, depending on the dataset. For YOLOv8, the model selection strategy implemented in the Ultralytics package was used, which trains the model for the full 100 epochs, stores checkpoints after every epoch, and then selects the model with the best fitness (defined as a weighted sum of mAP\(_{50}\) and mAP\(_{50-95}\) with 0.1 and 0.9 weights, respectively). YOLOv8 models’ training ranged from 55–78 epochs on average, depending on the dataset.

A different training regime was employed to compare our approach with the previously reported results of eight model-centric domain generalization approaches on the S-UODAC dataset [10]. We followed the Chen et al. [10] procedure as closely as possible by training Faster R-CNN for 12 epochs (no model selection procedure) with a stochastic gradient descend optimiser with a learning rate of 0.02, momentum of 0.9, and weight decay of 0.0001. The learning rate was decayed after the first 10 epochs by a factor of 10. We used a batch size of 4, NMS threshold of 0.5, detection threshold of 0.05, and input image size of a maximum \(1333 \times 800\) pixels. As in Chen et al. [10], we trained YOLOv3 (not YOLOv8) for 100 epochs (no model selection procedure) using the Adam optimiser with a learning rate of 0.001 and momentum of 0.9. Multi-scale training was enabled. The batch size of 8, NMS threshold of 0.5, detection threshold of 0.02, and input image size of a maximum \(416 \times 416\) were used.

Detection models were trained on NVIDIA A100 graphical processing units (GPUs), always using the same reproducible environment with Python 3.9.18, Pytorch 1.13.0, CUDA 11.6, Ultralytics 8.0.171, Detectron2 (commit fc9c33b1f6e5d4c37bbb46dde19af41afc1ddb2a), NumPy 1.25.2, Pycocotools 2.0.7, Scikit-learn 1.1.3, Scikit-image 0.19.3, OpenCV 4.8.0.76, and Pillow 9.4.0.

3.5 Image processing and enhancement methods

We selected 14 image processing and image enhancement methods (Table 2 and Fig. 13 in Appendix A) and tested each method individually by comparing object detection performance when the processing or enhancement was applied and when no processing or enhancement was applied. The image processing and enhancement methods were selected based on their implementation availability [5, 13, 92] or having been highlighted in studies on underwater object detection [1, 18, 23, 26, 59, 100, 104], however those studies did not evaluate their suitability for improving domain generalization, which is the focus of our work.

The 14 evaluated methods comprised of general-purpose and underwater-specific approaches of different levels of complexity. Four methods performed global intensity adjustment (gamma up/down, adjust log, adjust sigmoid, and auto-contrast), focusing on pixel-wise transformations without considering local features or spatial information. Conversely, contrast limited adaptive histogram equalisation (CLAHE) [77] is a local contrast enhancement method, which aims to prevent amplification of noise. Unsharp mask sharpens images by increasing contrast around edges. Grey world and automatic colour equalization (ACE) [81] focus on adjusting white balance and colour distribution. Grey world assumes that the average colour in an image should be neutral grey, and balances colours accordingly. ACE combines global and local colour correction based on both grey-world and white-patch assumptions, allowing more flexibility in colour balancing than grey world by incorporating local effects. Dark channel prior (DCP) [28] and colour attenuation prior (CAP) [110] were developed originally for haze removal in terrestrial images, but they were highlighted in a previous study for their positive effects on improving underwater image quality [59]. Multi-scale retinex with colour restoration (MSRCR) [35] balances dynamic range compression by processing the image at multiple scales, followed by colour restoration to avoid greying. Finally, we included three underwater-specific image enhancement methods: automatic red channel restoration (ARCR) [24], minimal colour loss and locally adaptive contrast enhancement (MLLE) [105], and fully unsupervised image enhancement generative adversarial network (FUnIE-GAN) [31]. ARCR corrects the red channel, compensating for the attenuation of red light in water. MLLE adapts to local regions while minimizing colour loss, considering specific features of underwater images. FUnIE-GAN is a deep learning-based approach that uses a generative adversarial network (GAN) to restore underwater images. We used the model trained in a paired fashion, where the generative model learns to map underwater images (\(256 \times 256\) pixels) of low perceptual quality to their high-quality counterparts by minimizing the difference between the generated and ground-truth images.

3.5.1 MSRCR implementation

MSRCR addresses dynamic range compression and colour rendition based on the retinex theory of how human colour vision retains colour consistency across varying illumination conditions [43]. It operates by decomposing an image at multiple scales, typically three, to capture fine, medium, and coarse details. These decompositions are then combined to produce an output that enhances local contrast and tonal rendition while preserving the overall structure of the image. To ensure that the enhanced images maintain natural and vivid colours, MSRCR involves a colour restoration step that prevents a greying effect often encountered with retinex algorithms.

The multi-scale decomposition includes three passes of the Gaussian blur filter at different scales determined by the standard deviation of the Gaussian distribution (\(\sigma \)) and the kernel size, which defines the size of the convolution filter (in pixels) used to perform the blurring operation. MSRCR is configured by choosing the \(\sigma \) values, while the respective kernel sizes (k) are typically determined using a rule-of-thumb formula, such as OpenCV’s createGaussianKernels function, which sets \(k = 8 \sigma + 1\).

We first implemented MSRCR using OpenCV’s implementation of the Gaussian blur filter, but this approach was prohibitively slow, averaging 3,900 milliseconds (ms) per image (\(1333 \times 750\) pixels, Apple M2 chip). To address this inefficiency, we implemented MSRCR using NVIDIA’s DALI library [71], which offers a GPU-accelerated implementation of the Gaussian filter. However, DALI’s implementation limits the kernel size to a maximum of 169 pixels. While this is sufficient to capture fine (local) details, it is too small for medium and coarse (global) details needed for implementing MSRCR. To circumvent the kernel size limitation, we adopted the following approach: for any kernel size \(k > 169\), the image was resized by a factor of \(\frac{169}{k}\), Gaussian filter with \(\sigma = 21\) and \(k = 169\) was applied to this low resolution image, and the result was resized back to the original resolution. Formally, for an intensity value I(xy) at pixel position (xy), MSRCR is the product of retinex operation R and colour correction C:

$$\begin{aligned} \text {MSRCR}(x, y) = g \cdot \left( R(x, y) \cdot C(x, y) + b\right) \end{aligned}$$
$$\begin{aligned} C(x, y) = \beta \cdot \left( \log \left( \alpha \cdot I(x, y) \right) -\log \left( \textstyle \sum I(x, y) \right) \right) \end{aligned}$$
$$\begin{aligned} R(x, y) = \sum _{i=1}^{n} w_i \cdot R_{\sigma _i}(x, y) \end{aligned}$$
$$\begin{aligned} R_{\sigma _i} \left( x, y \right) = \log \left( I \left( x, y \right) \left) - \log \right( I_{\text {blurred}}^{\sigma _i} \left( x, y \right) \right) \end{aligned}$$
$$\begin{aligned} I_{\text {blurred}}^{\sigma _i} \left( x, y \right) ={\left\{ \begin{array}{ll} F_{\sigma _i} \left( x, y \right) * I \left( x, y \right) & \text {if } k_i \le 169 \\ \text {resize} \left( F_{\sigma _i=21} \left( \hat{x}, \hat{y} \right) * I_{\text {scaled}} \left( \hat{x}, \hat{y} \right) \right) & \text {if } k_i > 169 \end{array}\right. } \end{aligned}$$

where \(g=5\), \(b=25\), \(\alpha =125\), \(\beta =46\), \(n=3\), \(w_1=w_2=w_3=\frac{1}{3}\), \(\sigma _1 = 15\), \(\sigma _2 = 80\), \(\sigma _3 = 250\), \(k_i = 8 \sigma _i + 1\), \(F_{\sigma _i}\) is the Gaussian filter, \(*\) denotes convolution, and \(I_{\text {scaled}} \left( \hat{x}, \hat{y} \right) \) is the pixel intensity at the corresponding pixel position \((\hat{x}, \hat{y})\) after downscaling the image by a factor of \(\frac{k_i}{169}\). This approach worked well both qualitatively and quantitatively due to the strong blurring effect of the Gaussian filter for the \(\sigma \) values commonly used for MSRCR.

We adopted the default MSRCR configuration with \(\sigma \) values of 15, 80, and 250 [35] and compared it with other configurations of MSRCR by adjusting the default \(\sigma \) values by a factor of \(\frac{1}{3}\), \(\frac{1}{2}\), and 2. Furthermore, we performed an ablation study of MSRCR by reducing the number of scales used in the default configuration to a single scale with \(\sigma \) values of 15, 80, and 250, and two scales with \(\sigma \) values to 15 and 80, 15 and 250, and 80 and 250. For the single scales experiments, no colour correction was performed as we deemed it unnecessary after experimenting with a few images.

3.5.2 Estimation of image processing times

To estimate the delay caused by employing an image pre-processing step prior to inference with an object detection model, we measured the mean processing time of each individual image enhancement method over 100 DeepFish images (the dataset with the highest resolution in our study) at a resolution of \(1333 \times 750\), which is the default input size of the Faster R-CNN model for this dataset. All methods, except for the GPU-accelerated implementations of MSRCR and FUnIE-GAN, were tested on an Apple M2 chip. GPU-accelerated implementations of MSRCR and FUnIE-GAN were tested on an NVIDIA A100.

3.6 Image quality measures

The underwater image quality measure (UIQM) [74] was used to evaluate the effectiveness of image enhancement methods in improving perceived image quality. UIQM is a weighted sum of three components: the underwater image contrast measure (UIConM), the underwater image colourfulness measure (UICM), and the underwater image sharpness measure (UISM), each reflecting different aspects of visual quality in underwater environments. UIConM, UICM, and UISM were designed to correlate with human perception of visual quality. Additionally, we used three general, not underwater-specific, measures of image quality: the naturalness image quality evaluator (NIQE) [67], the structural similarity index measure (SSIM) [96], and the peak signal-to-noise ratio (PSNR). PSNR quantifies the error between the original and the enhanced images and is given by

$$\begin{aligned} \text {PSNR} = 10 \log _{10} \left( \frac{\text {M}^2}{\text {MSE}} \right) \end{aligned}$$

where M is the maximum possible pixel value of the image (in our case 255), and MSE is the pixel-wise mean squared error between the original and the enhanced images.

3.7 Silhouette analysis

We developed a measure based on silhouette analysis [82], a clustering quality assessment technique, to quantify the effect of the image processing and enhancement methods on domain-associated visual features. First, we resized all images to \(256 \times 256\) resolution (65,536 pixels) and then further reduced the dimensionality using the principal component analysis (PCA) into two dimensions (we repeated the analysis also with five and ten dimensions observing similar results). Next, we quantified how “intermixed” the images (represented with the principal components) from distinct domains were using silhouette scores, a statistic ranging from \(-1\) to 1. A score of 1 represents perfectly separated clusters, a score close to 0 represents overlapping clusters, and a score of \(-1\) represents samples occurring in unrelated clusters. We defined the negative mean silhouette score (NMSS) as the mean of silhouette scores over the entire dataset:

$$\begin{aligned} {\text {NMSS}}=-\frac{1}{N} \sum ^{N}_{i=1}{\bigg [\frac{b(i) - a(i)}{\max {\{a(i), b(i)\}}} \text { if } |D(i)|>1 \text { else } 0\bigg ]} \end{aligned}$$

where N is the number of data points in the dataset, D(i) is the domain to which the data point i belongs, a(i) is the mean distance between i and all other data points of D(i), and b(i) is the mean distance between i and all data points of a neighbouring domain (i.e. domain with the smallest mean distance to i). Thus, by treating domain information as cluster membership, an increased NMSS (relative to raw images without any processing) indicates that the image enhancement method diminished some of the domain-specific visual features.

4 Results and discussion

We first tested how well off-the-shelf object detectors work in an out-of-domain underwater object detection setting (Section 4.1). Next, we used the proposed framework to evaluate 14 image processing and enhancement methods in mitigating the drop in the detection accuracy caused by the domain shift (Section 4.2) and analysed which aspects of image enhancement contribute to this improvement (Section 4.3). Finally, the best performing image enhancement method was subjected to parameter sensitivity and computational scaling analyses (Section 4.4) and compared with model-centric domain generalization approaches using an independent test set (Section 4.5).

Fig. 2
figure 2

Precision-recall curves comparing detection accuracy when models were evaluated under random and domain-based splitting cross-validation

4.1 Comparison of off-the-shelf object detectors using random and domain-based splitting evaluation

We found striking differences in detection accuracy depending on whether random or domain-based splitting was used (Fig. 2). Under the random splitting evaluation, both Faster R-CNN and YOLOv8 detection algorithms yielded high mAP\(_{50}\) values of 96.3%–96.4%, 86.2%–87.3%, and 90.3%–91.4% for DeepFish, MBEEC-Low-Vis, and Jellytoring datasets, respectively. In contrast, the domain-based splitting evaluation yielded mAP\(_{50}\) values below 53%. Faster R-CNN and YOLOv8 performed comparably to each other with mAP\(_{50}\) values of 41.6%–46.8%, 46.9%–52.6%, and 40.3%–43.1% for DeepFish, MBEEC-Low-Vis, and Jellytoring, respectively.

Under the domain-based splitting evaluation, AP\(_{50}\) values across the different object categories (classes) ranged from 26.9%–78.1% and 0.0%–89.7% for MBEEC-Low-Vis and Jellytoring (the two datasets with multiple object categories), respectively. The detection accuracies of Faster R-CNN and YOLOv8 were consistent across the different fish and jellyfish species with Pearson correlation coefficients (r) of 0.66 and 0.99 for MBEEC-Low-Vis and Jellytoring, respectively.

The variability in the number of training images and domains in the training folds of the cross-validation procedure was relatively small with the number of images ranging from 1,815–2,441, 1,360–2,118, and 1,166–1,452 and the number of domains ranging from 7–10, 6–15, and 72–90 for DeepFish, MBEEC-Low-Vis, and Jellytoring, respectively. This level of variability did not result in detectable association with the detection accuracy (Fig. 14 in Appendix A). Thus, we subjected the DeepFish dataset to subsampling experiments by reducing the number of training images to 400. When the number of training domains was not reduced during subsampling, the difference in mAP\(_{50}\) was limited to 1.0 and 1.4 percentage points (p.p.) for Faster R-CNN and YOLOv8, respectively. However, when the number of training domains was limited to 4 and 2 (down from 9) on average, mAP\(_{50}\) decreased by 5.4 (8.8) and 10.6 (15.2) p.p. for Faster R-CNN (YOLOv8), respectively (Fig. 3). This highlights the benefit of including many diverse domains in the training set.

Fig. 3
figure 3

Mean average precision (mAP\(_{50}\)) after subsampling the training images and domains of the DeepFish dataset. The individual data points correspond to three random subsampling replicates of the three randomly seeded data splits

Despite defining domains simply as location-date or habitat of the video recordings, the influence of domain-associated visual features was observable by projecting the images into the space of the first two principal components (Fig. 4a–c). Furthermore, both detection algorithms achieved high accuracy using the random splitting evaluation, in which visual features of all domains had been observed, and could have be learned, during training. This is in stark contrast to the domain-based splitting evaluation, in which the accuracy plummets because, at test time, the object detectors encounter visual features that had not been observed during training—an effect that was further pronounced when the number of training domains was subsampled. Therefore, the disparity in detection accuracy between within-domain and out-of-domain evaluation can be attributed to the inability of off-the-shelf detectors to respond to domain shift. This highlights that the assumption about image independence between training and test data under the random splitting evaluation is unreasonable in the case of underwater object detection due to strong visual effects of environmental conditions in aquatic environments.

Fig. 4
figure 4

First two principal components of raw images with no processing (ac) and after processing with MSRCR (df). For Jellytoring only domains with at least 40 images are shown

4.2 Evaluation of image processing and enhancement methods for underwater domain generalization

Five image enhancement methods (MSRCR, MLLE, grey world, ARCR, CLAHE) improved mAP\(_{50}\) of Faster R-CNN on the discovery dataset (DeepFish) by more than 3 p.p. compared to the baseline performance (46.8%) when raw images with no processing were used (Table 3). The best performing method—MSRCR (mAP\(_{50}\) 52.3%)—yielded an improvement of 5.5 p.p. (Figs. 5 and 15 in Appendix A).

When evaluated on the replication datasets, MSRCR delivered consistently better mAP\(_{50}\) values (57.7% and 46.3%) than raw images with no processing (52.6% and 43.1%) for both MBEEC-Low-Vis and Jellytoring, respectively (Table 3). These results constituted respective improvements of 5.1 and 3.2 p.p. Except for CLAHE, the other methods that improved mAP\(_{50}\) on DeepFish performed well also on the replication datasets with MLLE and auto-contrast yielding the second highest \(\Delta \text {mAP}_{50}\) values of 5.0 and 3.1 p.p. on MBEEC-Low-Vis and Jellytoring, respectively. These results demonstrate that the proposed framework identified multiple methods that can consistently aid domain generalization in underwater object detection.

Table 3 Underwater domain generalization evaluation using Faster R-CNN
Fig. 5
figure 5

Precision-recall curves comparing detection accuracy with and without MSRCR pre-processing evaluated with the domain-based splitting cross-validation

Table 4 Underwater domain generalization evaluation using YOLOv8

To further validate the reproducibility of these results, we tested the ability of the image enhancement methods to aid underwater domain generalization also with YOLOv8. On the discovery dataset, the top methods identified with Faster R-CNN, all delivered mAP50 improvements (compared to raw images with no processing) of more than 6 p.p. when coupled with YOLOv8 (Table 4). This result was reproducible also across the two replication datasets with improvements of more than 1.5 and 2.0 p.p. for MBEEC-Low-Vis and Jellytoring, respectively. MSRCR was again the best performing image enhancement method on the discovery as well as the two replication datasets. MSRCR yielded mAP\(_{50}\) values of 55.1%, 55.7%, and 45.6% for DeepFish, MBEEC-Low-Vis, and Jellytoring, respectively, which constituted improvements of 13.4, 8.8, and 5.3 p.p. compared to raw images with no processing (Fig. 5 and Fig. 16 in Appendix A).

Fig. 6
figure 6

Heatmaps depicting consistency of image processing and enhancement methods with respect to \(\Delta \text {mAP}_{50}\) (the difference in mAP\(_{50}\) relative to using raw images with no processing). Methods are ordered by the mean \(\Delta \text {mAP}_{50}\) across all datasets and both detection algorithms

Fig. 7
figure 7

Example detections using images from DeepFish (ac), MBEEC-Low-Vis (df), and Jellytoring (gi) datasets. Ground truth (a, d, g) with the different fish (a, d) and jellyfish (g) species highlighted with bounding boxes of different colours. Detections with a model trained with no image processing (b, e, h). Detections with a model trained with MSRCR pre-processing (c, f, i). Correct detections match location and colour of the ground truth bounding boxes

Most methods performed consistently across the three datasets and the two detection algorithms (Figs. 6 and 17 in Appendix A). Two notable exceptions to this consistency were DCP and CAP, which delivered improvements of 12.0 and 7.2 p.p. on DeepFish with YOLOv8 but had minimal or negative influence when coupled with Faster R-CNN, and when applied to any other dataset, regardless of the detection algorithm. This highlights the need for using a combination of several underwater datasets and learning algorithms when evaluating image enhancement methods for downstream applications such as object detection.

Out of the global intensity adjustment methods, only auto-contrast improved detection accuracy consistently across the three datasets and the two detection algorithms. Auto-contrast performs linear rescaling of pixel intensities for each channel independently, which apart from improving contrast resulted in a reasonable colour correction for many of the underwater images (Fig. 13 in Appendix A). Judging based on a visual assessment of selected images, the low (often negative) \(\Delta \text {mAP}_{50}\) performance of the other global intensity adjustment methods were likely due to little improvement in contrast (gamma up, gamma down, and adjust log) or clipping the details in the dark and bright parts of the images (adjust sigmoid). Other than auto-contrast, methods including CLAHE, ACE, and grey world provided mostly positive improvements, but of lesser magnitudes than ARCR, MLLE, and MSRCR. While ARCR and MLLE, both designed specifically for visual restoration of underwater images, performed consistently well in our evaluation, the other underwater-specific method, a deep learning model FUnIE-GAN, resulted in mostly negative effect on detection accuracy, likely because it was trained on images of much smaller resolution (\(256 \times 256\) pixels) than the resolution used by Faster R-CNN and YOLOv8. Finally, MSRCR, the best-performing method, effectively improved object detection despite sometimes producing visual artifacts such as halo effects or low colour consistency (Fig. 18 in Appendix A).

Fig. 8
figure 8

Mean average precision (based on the Faster R-CNN model) as a function of the negative mean silhouette score (NMSS). The lines show a linear regression fit with the 95% confidence interval shown as the shaded area

Overall, the proposed framework identified and validated several image enhancement methods which can counteract domain shift in aquatic environments. MSRCR yielded the most accurate detections and was most consistent across the three datasets and the two algorithms. Figure 7 shows several example images for which the effect of MSRCR pre-processing resulted in an improved detection accuracy.

Fig. 9
figure 9

A heatmap illustrating the correlation between image quality measures and mAP\(_{50}\) across the 14 image processing and enhancement methods. For each method, a mean image quality was calculated across all images and a mean mAP\(_{50}\) across the two detection algorithms

4.3 Analysis of the effect of image enhancement on domain-associated visual features

Plotting the first two principal components of raw and MSRCR-processed images revealed that MSRCR diminished the domain-associated visual features to some degree (Fig. 4). We quantified this effect across all image processing and enhancement methods using the negative mean silhouette score (NMSS). Strikingly, NMSS was correlated with mAP\(_{50}\) across the three datasets and the two detection algorithms (r of 0.43–0.86) with MSRCR yielding the highest NMSS values (Fig. 8). Increasing dimensionality of the PCA projections to five and ten principal components confirmed these results. This demonstrates that MSRCR and other image enhancement methods diminish domain-associated visual features, which is likely the reason for the observed detection accuracy improvements. The correlation between the NMSS measure and the detection accuracy under domain shift could be exploited for optimising or developing new underwater image enhancement methods.

In addition to NMSS, we used several established image quality measures to quantify the relationship between the visual features and detection accuracy. Only NMSS yielded a strong, positive correlation between image quality and mAP\(_{50}\) values for all three datasets (Fig. 9). NIQE, SSIM, and PSNR showed little or negative correlation with mAP\(_{50}\). UIQM, designed to correlate with human perception of underwater image quality, did not show a consistent correlation pattern. Decomposing UIQM into the three scores it consists of revealed that each dataset’s mAP\(_{50}\) values were correlated with a different aspect of image quality. These were sharpness, colourness, and contrast for DeepFish, MBEEC-Low-Vis, and Jellytoring, respectively. Overall, the domain generalization performance of the image processing and enhancement methods did not correlate with measures quantifying the perceived image quality, but with NMSS, which quantifies the presence of domain-associated visual features by assessing the separability of domains in a dimensionality-reduced space of principal components.

To demonstrate that the mAP\(_{50}\) improvements were largely due to image enhancement mitigating domain shift, the object detectors were subjected to the random splitting evaluation. None of the image processing and enhancement methods, including MSRCR (Fig. 10), yielded an improvement greater than 0.5 p.p. on any of the three datasets (Tables 6 and 7 and Figs. 15 and 16 in Appendix A). This discrepancy in \(\Delta \text {mAP}_{50}\) between the domain and random splitting evaluations further supports the notion that MSRCR and other image enhancement methods improve underwater domain generalization by reducing domain-associated visual features. That is, under the random splitting evaluation, in which case all domains are including in the training set, reducing domain-associated visual features does not improve accuracy. This contrasts with domain-splitting evaluation, in which the reduction of domain-associated visual features lessens the domain shift between the training and test sets, leading to improved detection accuracy.

Fig. 10
figure 10

Precision-recall curves comparing detection accuracy with and without MSRCR pre-processing evaluated with the random splitting cross-validation

Fig. 11
figure 11

Mean average precision of the YOLOv8 model (y-axis on the left) and MSRCR processing time in milliseconds (y-axis on the right) across a range of \(\sigma \) configurations

4.4 Evaluation of different MSRCR configurations

We evaluated several MSRCR configurations regarding sensitivity to the number of scales and the \(\sigma \) parameters. On the discovery (DeepFish) dataset using YOLOv8, all tested combinations resulted in mAP\(_{50}\) values greater than using raw images with no processing. The configuration with the best mAP\(_{50}\) performance was the default MSRCR with \(\sigma \) values of 15, 80, and 250 (Fig. 11). Using three scales performed better than using single or two scales, and using two scales performed better than a single scale as long as a scale extracting fine (local) details was included. The importance of the local details was highlighted by the relatively high detection accuracy achieved with a single-scale retinex when \(\sigma \) was set to 15. Inspecting a number of images, this configuration resulted in unnatural visual effects due to a high increase in locally-limited contrast (Fig. 19 in Appendix A). We speculate that the local contrast aided in distinguishing object shapes, thereby improving detection accuracy, despite the low level of perceived visual quality. Conversely, configurations that did not include a scale for extracting local features resulted in soft-looking images and lower detection accuracy, with single-scale retinex \(\sigma =80\) and \(\sigma =250\) being the worst performing of the ten tested configurations.

The number of scales and the \(\sigma \) parameters were both associated with MSRCR’s running time. While large values of \(\sigma \) would generally result in increased running time due to the higher number of neighbouring pixels being processed at each position, the relationship with \(\sigma \) was not linear because our implementation resizes the inputs for the Gaussian blur filter when \(\sigma > 169\). Thus, \(\sigma \gg 169\) had shorter running times than \(\sigma < 169\) (Fig. 11). This effect was diminished when we employed the GPU-accelerated implementation, averaging at 8, 9, and 10 ms for the single-, two-, and three-scale retinex configurations, respectively. The GPU implementation could be further improved by batching inputs, for example when processing videos from multiple cameras, with three-scale retinex achieving speeds of 8 ms per frame. In summary, MSRCR with its robust performance across different configurations and fast GPU-accelerated implementation could be used in a range of real-time underwater object detection applications.

4.5 Comparison of image enhancement with model-centric domain generalization methods

Evaluation of MSRCR image pre-processing using the S-UODAC dataset resulted in mAP\(_{50}\) values of 61.3% and 63.1% when coupled with YOLOv3 and Faster R-CNN, respectively, which represents respective improvements of 23.1 and 9.4 p.p. compared to raw images with no processing (Table 5). Relative to the best model-centric approach (DMCL), MSRCR improved the mAP\(_{50}\) values by 8.0 and 1.7 p.p. for YOLOv3 and Faster R-CNN, respectively. Figure 12 shows the effect MSRCR pre-processing for an example image from the S-UODAC dataset. This result further demonstrates the ability of our framework to identify image enhancement methods that can robustly and consistently aid domain generalization in underwater object detection. Given the orthogonality of our data-centric and the existing model-centric approaches, combining the two could result in further improvements to detection accuracy.

Table 5 Comparison with related work using the S-UODAC dataset
Fig. 12
figure 12

Example detections from the S-UODAC dataset. Ground truth (a) with the different species highlighted with bounding boxes of different colours. Detections with a model trained with no image processing (b). Detections with a model trained with MSRCR pre-processing (c). Correct detections match location and colour of the ground truth bounding boxes

5 Conclusions

Underwater environments are highly variable with domain shifts caused by variable turbidity, colour casts, and light conditions occurring rapidly even when models are deployed at a single geographical location. Therefore, it is imperative to develop and deploy models robust to domain shifts. Here we developed a data-centric framework for testing domain generalization in underwater object detection with a robust and comprehensive cross-validation approach. We used this framework to demonstrate that there is a large difference between within-domain and out-of-domain prediction performance of two widely used off-the-shelf object detectors. We hypothesised that the visual differences associated with the domain shift can be counteracted with an image pre-processing step. Applying the proposed framework to test 14 image processing and enhancement methods revealed that while some of the methods could aid domain generalization, others had negligible effects or even lowered the detection accuracy compared to using raw images with no processing. We proposed that a silhouette-inspired score (NMSS) could quantify the presence of domain-associated visual features and found a correlation between this score and out-of-domain detection accuracy. Thus, NMSS could be used in designing and optimising image enhancement methods for combating domain shift inherent to aquatic environments.

MSRCR was the best performing image enhancement me-thod in our out-of-domain evaluation, consistently improving mAP\(_{50}\) by 3.2–13.4 p.p. across the three real-world aquatic datasets. This contrasts with previous studies which focused on applying image enhancement to the within-domain underwater object detection setting and found only a limited degree of improvement [59, 100]. We further validated our results in a comparison with existing work on an independent test set, which highlighted that detections with MSRCR-based image pre-processing were more accurate by 1.7–8.0 p.p. (mAP\(_{50}\)) than those of the model-centric domain generalization methods.

Because domain shift can occur rapidly in underwater environments, improving domain generalization has significant practical implications across several fields. In aquaculture, object detectors robust to domain shift can improve the accuracy of monitoring fish behaviour and welfare, supporting better fish health management [68]. In environmental and ecological monitoring, robust and reliable detection of a variety of species can assist conservation efforts [16]. Similarly, for underwater exploration with autonomous vehicles, robust detection methods can adapt to diverse underwater environments, enhancing navigation and data collection even with fluctuating light and water conditions [6].

There are several limitations in the scope of our current work. First, most of the 14 image processing and enhancement methods have parameters that could be optimised to achieve better performance, but for practical reasons we used default parameter for all methods. To mitigate the potential problem of default values putting some methods into an unfair advantage, we used three diverse datasets to confirm the robustness of our findings. Second, we used the publicly available weights of the FUnIE-GAN model, which was trained on images of a considerably smaller resolution than the resolution used by the two detection algorithms. This could have limited FUnIE-GAN’s performance in our evaluation. Third, while we included both two-stage and single-stage object detection algorithms, these were based on the CNN architecture. Validating our findings with a transformer-based detection algorithm would provide further evidence to versatility of the proposed approach.

To conclude, the proposed framework allowed us to demonstrate the importance of out-of-domain evaluation in underwater object detection and the significant contribution of image enhancement to combating domain shift. Key ingredients to our framework when applied to underwater object detection are domain-splitting cross-validation, distinct discovery and replication datasets, and randomly seeded replicates. While here we considered 14 image enhancement methods, three datasets, and two object detection algorithms, the proposed framework provides guidelines to conduct other types of empirical evaluation studies in a robust and reproducible fashion. In future work, we plan to explore if online optimisation of the MSRCR’s parameters [29] could further improve the detection accuracy and experiment with combining our data-centric approach with model-centric methods for domain generalization.