Keywords

1 Introduction

Water quality in marine or freshwater areas such as rivers or lakes can be estimated through diatoms identification. Numerous studies support that biological indices based on these species help to state the ecological status of water in these environments. Nowadays, this procedure is carried out manually, which is a time-consuming and challenging task. Expert taxonomists observe the preparations of water samples through optical microscopes, to identify and quantify the diatoms species. Thus, the implementation of automatic tools based on computer vision and machine learning techniques to perform this task is needed.

Some recent works have dealt with automatic diatom classification. These approaches try to predict the correct taxon name from image samples containing a single diatom. Some classifiers based on general handcrafted features are capable of obtaining around 98% accuracy [3], although novel techniques based on convolutional neural networks (CNN) achieve better results, above 99% accuracy [11]. However, it is common that in a single field of view (FoV) several diatoms of different taxa, sizes and shapes appear, along with other elements such as fragments or debris. In these cases, object detection or segmentation techniques are needed to locate all the regions of interest (ROI) present in the image, i.g., diatoms shells.

A recent review of phytoplankton image segmentation methods is presented in Tang et al. [16]. Most of the methods are based on classical methods such as region based segmentation [14, 18], filtering [8] and active contours (AC) [6]. As far as the authors know, only two works are using deep neural network for diatom segmentation ([16] and [12]).

The performance of previous classical methods ranges from 88% to 95% accuracy. The main drawbacks are that they are sensitive to noise, like those based on region segmentation, or they need to manually set the initial curve, in the case of AC. Moreover, most of them have been demonstrated only on a single taxon and on images containing a single diatom. Only the works of Zheng et al. [17] and previous work by the authors [8] were shown on images with multi-taxon. However, the work of Zheng et al. [17] was only demonstrated for a single taxon with an average precision of 0.91 and a sensitivity of 0.81. The work by Libreros et al. [8] is dependent on the image noise and features of the ROIs.

Object detection algorithms based on deep learning have been tested on diatoms, in previous work done by the authors [12], using a Region-based Convolutional Neural Network (R-CNN) [7] and a framework called Darknet with YOLO method. In R-CNN the first step is to provide region proposals and based on these proposals a CNN extracts image features to be classified by a Support Vector Machine (SVM). In YOLO, a single neural network is applied to the whole image. The network divides the image into regions and predicts the class and the bounding box probabilities.

YOLO gives better results than R-CNN in the evaluation carried out with 10 taxa in full microscopic images with multiple diatoms [12]. This is because the model has information about the global context since the network is fed with the full image. Thus, an average F1-measure value of 0.37 with 0.29 precision and 0.68 sensitivity is obtained by the R-CNN against an average F1-measure value of 0.78 with 0.73 precision and 0.85 sensitivity obtained with YOLO. The main problem with these methods is that they do not separate the ROIs properly when overlap occurs. Therefore, the quantification of diatoms is limited.

In this work, a complete comparison of several detection and segmentation frameworks have been applied to detect and quantify diatoms of 10 different taxa. This is the first time that Viola-Jones and Semantic Segmentation techniques are used and compared for diatom segmentation in microscopic images containing several taxon shells. The paper is organized as follows. In Sect. 2, image acquisition are described. The description of the tested methods are presented in Sect. 3 and the results obtained together with the evaluation metrics used are summarized in Sect. 4. Finally, conclusions are drawn in Sect. 5.

2 Materials

A dataset with enough samples to train such a demanding resource technique as deep learning is needed. For this reason, an extensive process of data collecting, labeling and processing have been performed. The complete dataset is available under request.

2.1 Data Acquisition and Labeling

The first step is to capture images with real samples of diatoms in similar conditions as they are observed under the microscope. Therefore, it is essential to recruit expert taxonomist. In our work, the taxonomist was responsible for collecting a large number of microscopic diatom images and perform the manual identification task. Thus, 126 diatom images including 10 different taxa have been used, with variety in terms of specific features (length, internal and external shape) and diatom concentration. All the images have been taken with a Brunel SP30 microscope, using a 60x objective and an image resolution of 2592 \(\times \) 1944 pixels.

The experts were provided with a labeling tool so that they were capable of manually label 1446 diatoms from the collected dataset. That is an average of 144 ROIs per taxa. There are many free labeling tools widely used to help in this task. VGG Image Annotator (VIA) [5] was selected in our case. VIA is an HTML file that can be opened in any standard web browser. The graphical user interface is friendly and easy to use, so once the images have been imported, the user only has to select the region and mark the points around the diatom shape. Finally, all the information can be stored in a JSON file to compose the ground truth (GT) dataset.

3 Methods

3.1 Viola and Jones

Viola-Jones is a classical detection method where the features to be learned by the detector are based on the Histogram of Oriented Gradients (HOG) [4]. This method aims to calculate the gradient (direction and magnitude) of the pixels in a region and group them in histograms. This descriptor is useful to detect shapes and contours, which is one of the most prominent features of diatom characterization. However, it is not invariant to object orientation, and this problem can be addressed with augmentation techniques such as rotation and flips. A series of weak classifiers are trained in cascade to detect instances at different scales and positions in the image using a sliding-window approach. A general scheme of the method is presented in Fig. 1.

Fig. 1.
figure 1

Cascade object detector based on Viola-Jones algorithm.

The development of this method is organized into three steps:

  1. 1.

    Negative generation. To improve the performance of the method, a proper set of negative images is built. To achieve that, the typical appearance of background images in diatoms slides was studied. The background usually has a small range of gray values, without any line defined. Taking this into account, an algorithm was developed to generate automatically random images that follow this appearance.

  2. 2.

    Training process. The parameters that are provided to customize the learning process are: the object training size, the negative samples factor, the number of stages, false alarm rate, and true positive rate.

  3. 3.

    Model testing. The parameters for detection are customized to prevent the over detections of artifacts, which is the main drawback of this technique. These parameters are:

    –:

    Minimum and Maximum bounding boxes size. This is the minimum and maximum object size that is supposed to be found in the images.

    –:

    Scale factor. This parameter determines how much the sliding-window increases its size at each iteration.

    –:

    Merge threshold. This is one of the most important parameters since it is useful to tune the detection/false-alarm ratio, as according to this overlap measure, contiguous identifications are joined.

3.2 Scale and Curvature Invariant Ridge Detector

Scale and Curvature Invariant Ridge Detector (SCIRD) is based on a Gaussian non-linear transformation filter and was presented for segmenting dendritic trees and corneal nerve fibers [1]. In the context of detecting diatoms in water resources, SCIRD filters bank is applied to diatom images, following by a post-processing method to segment structures related to diatoms using the following equations:

$$\begin{aligned} F(x;\sigma ;k)=\frac{1}{\sigma _{2}^2}\left[ \frac{(x_{2}+kx_{1}^2)^2}{\sigma _{2}^2}-1 \right] exp \left[ {-\frac{x_{1}^2}{2\sigma _{1}^2}-\frac{(x_{2}+kx_{1}^2)^2}{2\sigma _{2}^2}}\right] \end{aligned}$$
(1)

where \( (x_{1}, x_{2})\) represents a point in an image coordinate system, k is a shape parameter and \( \sigma = (\sigma _{1},\sigma _{2}) \) corresponds to standard deviations in the Gaussian distribution, at each coordinate direction. k, \(\sigma _{1}\) and \(\sigma _{2}\) are parameters provided by a user. From Eq. 1 is possible to obtain a set of pre-defined filter banks by spanning a set of values as parameters of mentioned variables.

After generating the filters bank, by the non-linear function transformation, a convolution operation is performed using the set of filters. The final result is obtained as the maximum value at each pixel position among convolution results. Each filter represents different shapes depending on \( \sigma = (\sigma _{1},\sigma _{2}) \) values. The \( \sigma \) parameters must be adjusted according to image contrast, debris concentration and low noise levels [8].

3.3 YOLO

This framework is based on a fully convolutional network, a different approach than traditional R-CNN and its family (Fast/Faster-RCNN), that speeds up the running time in several orders of magnitude [13]. The above methods are based on applying the model to the input image at multiple locations and scales. However, the latter uses a single neural network to the input image, once. This one is responsible for dividing the image into candidate regions (instead of trusting on new algorithms that add extra cost). Additionally, as the whole image is fed to the network (instead of several patches), the model has global information about the context of the object, being this more suitable for accurate decisions.

In detail, this framework divides the images in a cell matrix, so each cell is responsible for proposing a fixed number of candidate regions proposals. Then, each box is moved according to a predicted offset, so that it fits in size and location to a candidate object. Along with that, the prediction also provides with confidence for both the object class and the bounding box likeliness of being an object itself. The result is that myriads of boxes are proposed, but most of them have shallow confidence, so they can be easily rejected using a threshold. A general scheme of the method is presented in Fig. 2.

The training procedure was configured with a learning rate of 0.001 and 10000 epochs. The selected optimizer was Stochastic Gradient Descent with a 0.9 of momentum coefficient. The complete dataset was divided into a training subset of 105 images and a validation subset with the leaving 21 images. The mini-batch size was configured to 4 images. After the training stage, the model performance was evaluated using the ground truth masks.

Fig. 2.
figure 2

YOLO architecture flowchart

3.4 SegNet

SegNet is a Semantic Segmentation architecture originally designed for scene understanding applications, such as autonomous driving. For this reason, efficiency and speed at inference time are crucial. Some of the first deep learning Semantic Segmentation models tried to directly apply the deep neural network architectures designed for image classification to pixel-level classification. However, convolution, pooling and sub-sampling operations performed by CNNs may cause a reduction of the feature map resolution, losing spatial information which is essential for good boundary delimitation. To solve this, novel approaches emerged, such as Fully Convolutional Networks (FCNs) [9], DeconvNet [10], U-Net [15] or SegNet [2]. These models share a similar architecture, with slight differences. In this paper, SegNet is selected, due to the good accuracy and efficiency in terms of memory and computational time.

The architecture of SegNet is formed by an encoder network, a corresponding decoder network and a final pixel-level classification layer. The encoder network is formed by the first 13 layers of the popular VGG16 network, pretrained on a large image classification dataset, like ImageNet or COCO. These layers are a combination of convolution, batch normalization, ReLU and max-pooling operations which generate the feature maps. As aforementioned, convolution and pooling operations cause a reduction of the feature map resolution, affecting the final segmentation accuracy. In SegNet, the fully connected layers of VGG16 are replaced by a decoder network (one decoder for each encoder), which is responsible for upsampling the input feature maps to a higher resolution. To achieve this, the indices of each max-pooling layer (position of the maximum feature value) at encoding stage are stored to capture the spatial information, and, at decoding stage, these indices are used to perform the upsampling. Finally, the output of this decoding stage (the high resolution feature maps) is the input of a softmax layer, which carries out a pixel-level classification. In Fig. 3 a general scheme of the method is presented.

At the training stage, the specific diatoms dataset is applied to adapt the pre-trained COCO network weights to our problem. Data augmentation is used to improve the size and quality of the original dataset. This process is done applying different image processing algorithms to the original dataset, like image rotations, translations, crops, mirror effects, Gaussian noise, and contrast enhancements.

The training procedure was configured with a learning rate of 0.05 and 100 epochs. The selected optimizer was Stochastic Gradient Descent with a 0.9 of momentum coefficient. The complete dataset was divided into a training subset of 105 images and a validation subset with the leaving 21 images. The images were resized to 480 \(\times \) 360, preserving the aspect ratio, to allow a mini-batch size of 4 images. After the training stage, the model performance was evaluated using the ground truth masks.

Fig. 3.
figure 3

SegNet architecture flowchart

4 Results

The metrics used to measure the pixel-wise detection performance are:

  • Sensitivity: The sensitivity or recall can be measured in terms of true positives (TP) and false negatives (FN), at the pixel level, following Eq. 2. TP pixels are those that belong to the class and are predicted as positives. On the other hand, FN, also known as type 2 error, are pixels that belong to the class although are predicted as negative. This metric gives the proportion of correctly classified positives.

    $$\begin{aligned} \text {Sensitivity} = \frac{\text {TP}}{\text {TP} + \text {FN}} \end{aligned}$$
    (2)
  • Precision: Similar to the previous one although taking into account false positives (FP) instead of FN (Eq. 3). FP (type 1 error) pixels are those that do not belong to the ROI although they are predicted as positive. This metric gives the probability of correct detection if the prediction is positive.

    $$\begin{aligned} \text {Precision} = \frac{\text {TP}}{\text {TP} + \text {FP}} \end{aligned}$$
    (3)
  • Specificity: This metric gives the proportion of correctly detected negatives and follows Eq. 4. True negative (TN) pixels are those that do not belong to the ROI, and they are predicted as negatives.

    $$\begin{aligned} \text {Specificity} = \frac{\text {TN}}{\text {TN} + \text {FP}} \end{aligned}$$
    (4)
Table 1. Pixel-wise detection results for each method

Once the models have been trained, their performance has been evaluated using the metrics explained above over the dataset predictions. In this case, the evaluation has been done comparing the predicted bounding boxes with the GT dataset, at a pixel level. For most species, the SegNet and YOLO methods show the best results. Deep learning methods show better results than classical ones. Viola-Jones does not obtain good results for any taxa, but SCIRD obtains the best sensitivity values for some taxa. Precision is higher for the YOLO model; that is, there are few FPs compared to the other methods. However, the number of FPs is also high, and only an average precision of 0.73 is achieved. On the other hand, sensitivity is always higher for SegNet framework, indicating a low number of FNs and obtaining an average value of 0.96. Specificity is similar for SCIRD and YOLO methods with average values of 0.93 and 0.96 respectively. The average results per species with all the techniques are presented in Table 1.

Finally, the bounding boxes generated by all tested methods for ten sample images of the different taxa considered are illustrated in Figs. 4 and 5. Each row shows a sample image while each column shows the results of a detection method.

Fig. 4.
figure 4

Example of diatoms detected by each method (I). Each column shows the results of the segmentation methods: (\(1^{st}\)) Viola-Jones; (\(2^{nd}\)) YOLO; (\(3^{rd}\)) SCIRD and (\(4^{th}\)) SegNet.

Fig. 5.
figure 5

Example of diatoms detected by each method (II). Each column shows the results of the segmentation methods: (\(1^{st}\)) Viola-Jones; (\(2^{nd}\)) YOLO; (\(3^{rd}\)) SCIRD and (\(4^{th}\)) SegNet.

5 Conclusions

In this work, four different detection and segmentation frameworks have been applied to segment diatoms of 10 different taxa. This is a complex challenge due to the large variation in species and the slight differences between them. The methods are both based on classical approaches such as Viola-Jones and SCIRD, and deep learning based. Two segmentation techniques based on deep learning have been considered, that is: (i) object detection with YOLO and (ii) Semantic Segmentation by means of pixel-wise binary classification with SegNet.

This is the first time that Viola-Jones and Semantic Segmentation techniques are used and compared for multi-taxon shell segmentation, that detects different diatoms in full FoV microscopic images containing several taxon shells.

The deep learning approaches, SegNet and YOLO, showed the best results in the tests carried out. This may be due to the fact that the model has information about the global context since the network is fed with the full image. The best sensitivity is obtained with SegNet, with an average value of 0.96 versus 0.85 for YOLO. However, the best specificity is obtained with YOLO, with an average value of 0.96. SCIRD methods also achieve good specificity with an average value of 0.93. However, the precision goes down for all methods obtaining the best result by YOLO with a value of 0.73.

The sensitivity is improved with Semantic Segmentation as compared to previously published methods, but there is still room to improve the precision. The main problem with these methods is that they do not separate the ROIs properly when overlapping occurs. Therefore, the quantification of diatoms is limited. The promising results encourage us to continue working on this complex problem. The SegNet model performance can be improved by adding post-processing techniques such as morphological procedures to separate the diatom instances correctly; as well as explore new architectures based on instance segmentation.