1 Introduction

Semen evaluation is a fundamental diagnostic tool for assessing male fertility. In humans, it is utilized in a range of clinical contexts, including assisted reproduction, post-vasectomy monitoring, and the detection of sexually transmitted infections (STIs) [1,2,3]. In the livestock industry, it is widely used as tool for selection and breeding purposes [4,5,6].

Semen evaluation involves the assessment of various parameters by macroscopic and microscopic analysis of the samples. Macroscopic analysis measures characteristics such as semen composition, volume, liquefaction, viscosity, and pH. Microscopic evaluation involves the study of sperm clumping, motility, vitality, and cell count [7].

Automated systems to perform such microscopic tests have been developed and commercialized due to their high demand for three reasons. Firstly, qualified experts are needed, but they are scarce. Secondly, there are several dozens of spermatozoa moving in the semen sample, making it a difficult and time-consuming task. This entails a low upper limit to the expert’s bandwidth. And thirdly, the subjectivity and lack of a traceable standard have been identified as causes of the high level of uncertainty associated with manual sperm motility and morphology assessments [3, 8]. However, laboratories are reluctant to accept the results until they have been validated against a reliable standard. In other words, the automated report must be compared to the human report, which is prone to a high level of uncertainty unless a great deal of expert effort and time is invested [9].

The arrival of deep learning has enabled computer vision systems to match or surpass human performance in multiple object detection and tracking, classification, recognition, segmentation and image generation in many fields [10,11,12,13,14,15,16]. Computer vision techniques have been applied in the specific context of microscopic semen analysis to morphological classification of sperm heads [17, 18], and to motility analysis. This process entails the detection and tracking of spermatozoa heads within highly concentrated samples [12, 19, 20]. Recent studies indicate that additional parameters, such as acrosome morphology (a cap-like structure on the sperm head) [21, 22] and beating patterns, may be important indicators of sperm health [23,24,25,26,27,28]. However, the aforementioned analyses necessitate images with a low concentration of spermatozoa. Furthermore, in the case of flagellum analysis, current state-of-the-art techniques also require the selection of the spermatozoon from the seminal fluid or its isolation in a video [23, 24, 29].

This paper presents a methodology that employs computer vision and deep learning techniques to provide unlimited, synthetic, video-realistic, fully labeled and on-demand video datasets that can be used for training, improving, and benchmarking computer-assisted sperm analysis (CASA) systems.

We propose a spline-based parametric model of a spermatozoon and use a Denoising Diffusion Probabilistic Model (DDPM) to animate it. To achieve this, we extract clean video sequences of isolated spermatozoa and model them according to the spline-based model. We then use the DDPM to learn the behavior of various classes of spermatozoa, which subsequently serves to generate new synthetic trajectories. Finally, we apply a Cyclic Generative Adversarial Network (GAN) to perform domain transfer on the animated trajectories, thereby obtaining realistic synthetic video sequences. In addition to this method, we provide a new fully annotated synthetic dataset for motility analysis. This dataset comprises tracking and detection labels, spline parameters modeling the flagellum shape, and motility category of each sperm cell.

2 Related works

In this section, we provide an overview of relevant topics to our case study, including recent proposals for CASA systems, available sperm datasets, techniques for generating synthetic sperm datasets, and applications of diffusion models in trajectory generation.

2.1 Computer-assisted sperm analysis

The computer-assisted sperm analysis (CASA) has been a research topic for a long time. According to [11], sperm analysis tools up to 2017 still delivered poor quality trajectories for high sperm concentration samples. Thus, during the last decade, new CASA proposals leveraged the emergence of deep learning for classification, detection and segmentation. Here, we mention some of the most recent.

A CNN was used for morphology classification of sperms in [30]. Similarly, a healthy sperm classification given only sperm head images was presented in [18]. In the same line, [17] carried out morphology classification by training a CNN in three openly available sperm morphology datasets. Segmentation of the sperm heads on the SCIAN-SpermGS dataset [31] was addressed in [15]. A sperm tracking method suitable for samples recorded by a smartphone as a portable and low-cost platform was proposed in [32]. MotilitiAI consisted of a sperm analysis pipeline, including a tracker aimed at obtaining statistics of sperm movements, and a classification model to predict the fertility of the sample [33]. DeepSperm was a real time bull sperm detection trained with a dataset manually annotated by two experts [19]. Finally, in [12] YOLO was utilized to detect spermatozoa.

2.2 Sperm datasets

Although there are many open datasets of sperm video sequences, there is no homogeneity in terms of dataset size, capture conditions, microscope magnifications, colors and tones, annotations, and even classes in the ground truth. For example, the HuSHeM dataset [34] consists of 216 sperm images with sperm heads labeled as normal, tapered, pyriform and amorphous; whereas SMIDS dataset [30] collects up to 3000 images; however, these are only labeled as normal and abnormal. Below, we cite recent datasets to expose the diversity and lack of standardization.

  • VISEM dataset [35] contains videos from 85 human patients of around 2 minutes, each recorded with a \(400\times \) magnification. These videos are annotated with sperm motility, sperm concentration, total sperm count, ejaculated volume, sperm morphology and sperm vitality among data related to fatty acids, fatty serum, sex hormones and anonymous participant related info as age. Thus, they are meant for classification or regression rather than object detection or tracking. To overcome this issue, VISEM-Tracking was released [36] with 20 videos of 30 seconds each.

  • SVIA sperm detection and tracking dataset [37] has 101 human sperm video sequences obtained with WLJY-9000 CASA with \(\times 20\) magnification and \(\times 20\) electronic glasses. It was released split in three subsets: 1) for object detection tasks, with 3590 images and 125K annotated objects in three classes (sperm, debris or leukocyte), 2) for sperm head segmentation and tracking, with 451 frames and 26K annotations, and 3) head morphology classification with 125K images

  • SeSVID dataset [38] contains 12 videos of human semen recorded under \(100\times \) magnification lens. The purpose of this dataset is solving object detection tasks providing 92,329 labeled objects (68,244 are sperm) in 1175 images. Unlike other datasets in which the ground truth bounding boxes frame only the sperm’s heads, here they cover the entire head and tail.

  • SCIAN-SpermSegGS dataset [15] is specifically meant for segmenting the different parts inside the head (head, acrosome, cell nuclei and midpiece).

Recent works highlights the importance of the movement of the flagellum and suggests that it should be taken into account in fertility and motility assessments [23,24,25,26,27,28].

2.3 Synthetic datasets

Two different approaches to synthetic data generation in the domain of sperm analysis have been proposed in the literature. The first one involves modeling the sperm using expert knowledge in the field while the second is based on the use of generative AI techniques.

Modeling the sperm morphology and motility aims to overcome such a heterogeneity and obtain fully annotated datasets at the same time.

In the context of motility analysis, two approaches [39, 40] have been applied to generating video sequences of schematic spermatozoa, modeling the spatial movement of the head and flagellum. These approaches analyzed different kinds of spermatozoon movements [7] , and developed a mathematical model of those. However, these models only partially capture the nonlinear complexity of sperm movement, which can be influenced by various movement regimes, flagellum elasticity, and medium viscosity [27]. Regarding the realism of the generated videos, in the best cases, predefined hand-made rules over schematic frames are employed, such as adding Gaussian blur and salt-and-pepper noise [39]. Additionally, other authors incorporated a background extracted from real videos and included floating particles, such as white circles and shadows, to enhance realism [41].

The second line of research focuses on data driven techniques for generating synthetic images. Specifically, these works rely in Generative Adversarial Networks (GAN) [42] to augment a given dataset. In this context, GANs have been trained in [43,44,45] to augment the number of images from SMIDS, HuSHeM, SCIAN-MorphoSpermGS, MHSMA as well as one private dataset [44]. In [13] a GAN with capsule network architecture was used to augment and balance the HuSheM dataset. However, all these GAN-based approaches are focused on morphological analysis of the head only , but none of them provide segmentation of the whole sperm (head + flagellum).

2.4 Denoising diffusion probabilistic models

Denoising Diffusion Probabilistic Models [46], Diffusion Models for short, have shown to be more robust than earlier generative architectures, including Variational Autoencoders (VAE) [47], GAN [42] and Deep Autoregressive Models [48].

Their applications span a diverse range of tasks, such as conditional image generation [49], image to image translation [50], text to image generation [16], point clouds completion [51], natural language processing [52], and time series forecasting [53] among others [54].

The use of diffusion models for trajectory generation (related to the scope of this paper) was proposed in [55]. In this approach, trajectory generation is formulated as an inpainting process where the diffusion model transforms a random trajectory into a feasible and consistent one between two known locations. Similar approaches have been later successfully applied to modeling vehicle trajectories [56] and animating humanoid avatars [57,58,59] but none of them have been use on spermatozoa data.

3 Sperm video generation

Sperm video generation refers to the process of delivering a fully annotated sequence of frames that mimics both the dynamics of the sperm flagellum over time and the look and feel of the original video in terms of lighting, noise, artifacts, etc.

To this end, we firstly identify those usable spermatozoa from a video sequence and then carry out preprocessing on each one independently that results in a vectorized representation. The information extracted is then used to fit a parametric representation for each spermatozoon in each frame. Next, a model to generate new representations is learned from those collected from the real videos with a Denoising-Diffusion Probabilistic model. Finally, the representations generated are embedded into video frames and a style transfer neural network renders the resulting schematic sperm in pitch black background into a realistic video sequence. The details of each step are given in the following sections.

Fig. 1
figure 1

Top row: Process to obtain the parametric representation of a spermatozoon; (a) Neighborhood \(\mathcal {B}\) of a spermatozoon, (b) Rotation, (c) Spline fitting, (d) velocity vector. Bottom row: Trajectory generation of a schematic spermatozoon; (e) insert the window into the frame and rotate it back, (f) add the velocity vector to the position, (g) position and rotation in the next frame

3.1 Spermatozoon windowing

By windowing, we refer to the process of extracting every single spermatozoon from a video sequence that complies with certain requirements, and processing it to deliver a cropped, segmented and rotated version of it and its neighborhood in a sequence of windows.

To begin with, a YOLO v5 [10] detection network is trained on a previously hand-annotated dataset to locate the head of each sperm cell in every frame. Besides, in order to preserve the identities of the detections, all of them are tracked throughout the video sequence using Kalman filtering. Thus, the result is a collection of N detections that we denote as \(\{\delta ^{(i)}\}\), for \(i=1\ldots N\); and such that

$$ \delta ^{(i)} \!=\! \left[ (x_{1}^{(i)}, y_{1}^{(i)}), (x_{2}^{(i)}, y_{2}^{(i)}), \ldots , (x_{T}^{(i)}, y_{T}^{(i)}), \texttt{remove}^{(i)}\!\right] , $$

where \((x_{t}^{(i)}, y_{t}^{(i)}),\) with \({t=1,\ldots ,T},\) is the center of the bounding box with respect to the t-th frame and \(\texttt{remove}^{(i)}\) is a boolean flag for the i-th detection initialized to False. In other words, we use the term “i-th spermatozoon” to refer to the collection \(\delta ^{(i)}\) that univocally represents the position of the spermatozoon with identity i along the video sequence of T frames, along with the boolean flag.

In video of sperm, scenes are cluttered with frequent crossovers, clustering and splitting. For the method proposed in this paper, we need spermatozoa that do not encounter others during the whole sequence in its neighborhood. Let \(\mathcal {B}_t\big ((x_{t}^{(i)}, y_{t}^{(i)}); r)\), a ball of radius r centered in \((x_{t}^{(i)}, y_{t}^{(i)})\), be the neighborhood of the i-th spermatozoon in the t-th frame of the video. If there exists a j-th spermatozoon at the same frame t such that \((x_t^{(j)}, y_t^{(j)}) \in \mathcal {B}_t\), then \(\texttt {remove}^{(i)} =\) True. Hence, after checking all the possible crossings \((\delta ^{(i)}, \delta ^{(j)})\), we filter out those with the flag remove = True. Hence, the remaining are usable, as referred to above.

Let \(\{\zeta ^{(i)}\} = \left[ (x_{1}^{(i)}, y_{1}^{(i)}), (x_{2}^{(i)}, y_{2}^{(i)}), \ldots , (x_{T}^{(i)}, y_{T}^{(i)})\right] \), for \(i = 1, \ldots , n\), be the usable spermatozoa detections reindexed from 1 up to \(n\le N\) and without the flag remove because it is no longer needed. Such a reindexing does not affect to the terminology introduced, so in the following “i-th spermatozoon” is \(\zeta ^{(i)}\).

Thus, for each \(\{\zeta ^{(i)}\}\) and each t, the following steps are performed:

  1. 1.

    Crop a squared window centered at \((x_{t}^{(i)}, y_{t}^{(i)})\) of size \(S\times S\). Hence, the center of the window is locally (i.e. within the window) referenced as (0, 0) but globally (i.e. within the video frame) as \((x_{t}^{(i)}, y_{t}^{(i)})\).

  2. 2.

    Segment the spermatozoon in the window. To this end, adaptive threshold, dilate and erode operations are used.

  3. 3.

    Fit an ellipse to the full spermatozoa using the Fitzgibbon algorithm [60].

  4. 4.

    Rotate the window so that the major axis of the ellipse becomes horizontal and the spermatozoon is heading to the right. Let \(\alpha \) be the angle of that rotation, then we define a unit vector \(\vec {\alpha } = (\alpha _x, \alpha _y)\) such that

    $$\begin{aligned} \begin{array}{cccccc} \alpha _x = {1} / {\sqrt{1+\tan ^2 \alpha }}&~~\text {and}~~&\alpha _y = {\tan \alpha } / {\sqrt{1+\tan ^2 \alpha }}. \end{array} \end{aligned}$$
    (1)

    We adopt this vector to facilitate further learning and inference processes.

These steps are depicted in Fig. 1(a)-(c), together with other information described in the next subsection. Notice that the center of the window is the center of the detection, but not necessarily the center of the head.

3.2 Spermatozoon parametric representation

We propose a parametric representation of the i-th spermatozoon in frame t in terms of its head, flagellum, full body and velocity. For the sake of clarity, we omit superscript (i) and subscript t in all the parameters, unless it is needed.

  • Head: For the sake of simplicity, the head is just a circle of the same size in all the windows, locally centered in (0, 0) and globally in (xy).

  • Flagellum: We propose to model the flagellum as a 4-order Bezier Spline, given by:

    $$\begin{aligned} B(\lambda ; \mathcal {P}^{(4)}) = \sum _{k=0}^4 \left( {\begin{array}{c}4\\ k\end{array}}\right) (1 - \lambda )^{4-k} \lambda ^k \textbf{P}_k, \end{aligned}$$
    (2)

    where \(\mathcal {P}^{(N)} = \{\textbf{P}_k\}\), for \({k=1,\ldots ,N}\) and \(\textbf{P}_k \in \mathbb {R}^2\), are the control points; and \(\lambda \) generates the curve as it goes from 0 to 1. \(\textbf{P}_0\) and \(\textbf{P}_4\) are set to the head and end of the tail respectively, while \(\textbf{P}_1, \textbf{P}_2\) and \(\textbf{P}_3\) are fitted by means of the Nelder-Mead algorithm. Control points are depicted in Fig. 1(c).

  • Body: We use the unit vector \(\vec {\alpha } = [\alpha _x, \alpha _y]\) as obtained in (1).

  • Velocity: We use a velocity vector \(\vec {v} = [v_x, v_y]\), where \(v_x = (x_{t+1} - x_{t})\) and \(v_y = (y_{t+1} - y_{t})\) to indicate the direction and magnitude of the i-th spermatozoon global translation in two successive frames. Vector \(\vec v\) is shown in Fig. 1(d).

All the representation parameters are summarized in Table 1, indicating whether they are obtained locally (with respect to the window) or globally (with respect to the frame).

Table 1 Summary of parameters of the spermatozoon representation

Finally, we introduce the representation vector at frame t of the i-th spermatozoon

$$\rho _{t} = \left[ \mathbf{{P}_0},\mathbf{{P}_1},\mathbf{{P}_2},\mathbf{{P}_3},\mathbf{{P}_4},v_x, v_y, \alpha _x, \alpha _y \right] , $$

so the ordered array of \(\rho _{t}\) from \(t=1\) to \(t=T\) is the fully parametric representation of the trajectory followed by i-th spermatozoon along a video sequence of T frames

$$\tau = \left[ \rho _{1},\rho _{2},\ldots ,\rho _{T} \right] . $$

Notice that trajectory vector \(\tau \) renders both the movement and the travel of a schematic spermatozoon within the frame.

3.3 Sperm trajectory synthesis with diffusion models

Denoising-Diffusion Probabilistic Models (DDPM) [46, 55] have gained popularity due to the excellent performance in text-to-image generation. In a nutshell, given a sample (not necessarily an image) \(s_0\), it is successively corrupted by adding a small amount of Gaussian noise during H steps. In any two consecutive samples \(s_{k}\) and \(s_{k+1}\), it is possible to train a neural network for denoising \(s_{k+1}\) in a supervised way, using \(s_{k}\) as ground truth. During the training process the indices of the samples are also input. In inference, after all the denoising steps, a clean sample is obtained from a fully noisy one. Additionally, in text-to-image tasks, the resulting image is conditioned by the caption.

In the context of this paper, the sample is the trajectory vector \(\tau \) of the i-th spermatozoon along a video sequence of T frames. The goal is to produce a new trajectory \(\hat{\tau }\) conditioned on the representation vector at the first frame of the sequence \(\rho _{1}\) from a fully noisy vector of the same size and after H denoising steps by the neural network learned.

Fig. 2
figure 2

Diffusion denoising process to generate a trajectory vector \(\hat{\tau }\) in H steps. The first representation vector (with background filled) is always imposed to be the true \(\rho _1\)

The process is depicted in Fig. 2; in which \(\hat{\tau }_H\) consists of the true \(\rho _{1}\) followed by a random array that completes the trajectory vector. After one step h this array is modified (denoised) towards more meaningful values. After H steps, the array has been transformed into \(T-1\) meaningful representation vectors, resulting into a generated trajectory \(\hat{\tau }\). Then with \(\hat{\tau }\) and an initial global position (xy), a schematic spermatozoon is inserted in each frame of the video sequence. Finally, this process is repeated to produce N synthetic and fully labeled but schematic spermatozoa moving around.

3.4 Synthesis of real-looking videos

The resulting video sequence lacks of background noise, motion blur, lens and chromatic aberrations, artifacts, shadows, and other particles. Besides, real spermatozoa do not have a circular head and the mid-piece is missing in the synthetic one.

To deliver a realistic video it is necessary to incorporate all these features. To this end, we propose to do style transfer with a CycleGAN [61]. The CycleGAN learns to transform a schematic frame (domain A) to the style of a real frame (domain B) and vice versa. Completing a cycle from domain A to B and back ensures structural consistency between the generated images. Besides, both domains are indeed unpaired datasets, so the training process requires no human supervision at all.

Fig. 3
figure 3

Comparison of the distribution spline control points based on the KL divergence (Top) and the Earth Mover’s Distance (Bottom). The lower, the better for both. The dotted line is the set derived from method proposed; \(+\) , \(\star \) and \(\diamond \) represent LSTM, Gaussian and Train sets respectively

4 Experimental results

In this section we conduct exhaustive experiments to assess the morphology, motility and utility of the method proposed in quantitative terms. In addition, we present a qualitative analysis to validate the results from an expert perspective.

4.1 Dataset and experimental approach

We use a private porcine semen dataset provided by a commercial company that consists of 28 one-second videos recorded at 25 frames per second with resolution of \(1280 \times 1024\) pixels [62]. The videos were taken with a Motic Panthera C2 microscope, with a magnification ratio of \(10\times \), and a Blackfly USB3 camera with a 12.3MP Sony CMOS sensor.

After filtering out of all the spermatozoa initially detected, we end up with 72 usable sequences. Notice that the goal of this paper is to create synthetic video sequences using a suitable for the problem data augmentation method, instead of simplistic modifications such as sudden rotations or translations. We use a sliding interval of 16 frames, so a single sequence of 25 frames is transformed into 9 sequences of 16 frames (from frame 1 to 16, from 2 to 17, and so until from 9 to 24). The choice of 16 is a trade-off between the remaining sequence length and the number of times the dataset is increased. Frame 25 is necessary to have the velocity vector of the previous frame, but since there is not a following one, there is no sequence from frame 10 to 25. Hence, we end up with a total of 648 usable sequences with \(T = 16\).

From each sequence we obtain its trajectory vector as defined in the previous section. Since the porcine spermatozoon are no longer than 70 pixels, the window size chosen is \(S=140\).

We categorize each one according to their trajectory, following the World Health Organization specifications [7] as “progressive” (46%) , “slowly progressive” (22%) and “inmotile” (32%) sperm. We keep a stratified 10% of the sequences for testing and use the rest for training.

Our DDPM utilizes a U-Net [14] architecture with 3.96M of parameters. It incorporates a sinusoidal positional encoder [63] to determine the current step in the denoising process. The DDPM is applied to an input noisy trajectory of length 16, and uses 20 denoising steps. To further enhance training stability, we employ a model with an Exponential Moving Average [64]. The CycleGAN used for style transfer utilizes two U-Net generators, each comprising 11.4M parameters. The two discriminators models consist of convolutional networks, each with 2.8M parameters. Note that, once trained, only one U-Net generator is required for inference.

4.2 Baseline models

The following experiments consider three sets of synthetic sperm. A set derived from our method, and two sets derived from baseline models described below. We then compare their metrics with the real sperm set.

Gaussian model

As a first baseline, we assume that the trajectory vectors are normally distributed; that is \( \tau \sim \mathcal {N}\left( \mathbf {\mu }, \mathbf {\Sigma } \right) , \) where \(\mathcal {N}\) stands for multivariate normal with mean \(\mu \) and covariance matrix \(\Sigma \) computed on the training set. Thus, once the \(\mathcal {N}\) is fitted, new trajectory vectors \(\hat{\tau }\) are just sampled from it.

LSTM model

As a second baseline, we consider an LSTM model [65] in order to capture the time dependence within the sequence of frames. Specifically, a trajectory generated with this model would be \(\hat{\tau } = [\hat{\rho }_{1}=\rho _1,~ \hat{\rho }_{2}=\rho _2,~ \hat{\rho }_{3}=\rho _3,~ \hat{\rho }_{4}=\rho _4,~ \hat{\rho }_{5}, \ldots , \hat{\rho }_{T} ]\), such that we compute \(\hat{\rho }_{t} = \textrm{LSTM}(\hat{\rho }_{t-1}, \hat{\rho }_{t-2}, \hat{\rho }_{t-3}, \hat{\rho }_{t-4})\), for \(t=5,\ldots ,T\). In other words, to generate a trajectory requires a tuple of its first four true representation vectors.

4.3 Evaluation of synthetic sperm morphology

To assess the synthetic sperm morphology, we measure the similarity of the flagellum to real spermatozoa. To this end, we compare the respective distributions for each generated spline control point, \(\Pr (\hat{{\textbf {P}}}_{\varvec{0}})\) to \(\Pr (\hat{{\textbf {P}}}_{\varvec{4}})\), using the Kullback-Leibler (KL) divergence and the Earth mover’s distance (EMD). To compute the KL divergence we assume that all the distributions are univariate Gaussians. This is a strong assumption so we also compute the EMD approximating the continuous underlying distributions, \(\Pr (\hat{{\textbf {P}}}_{{\textbf {j}}})\) and \(\Pr ({\textbf {P}}_{{\textbf {j}}})\), by discrete PMFs. Since the shape of the flagellum is related to its motility, we consider the three categories namely progressive, slowly progressive and inmotile sperm. Additionally, to estimate the distributions \(\Pr (\hat{{\textbf {P}}}_{\varvec{0}})\) to \(\Pr (\hat{{\textbf {P}}}_{\varvec{4}})\) for each type of sperm, we generate 1000 full trajectoriesper type.

The quantitative results are shown in Fig. 3, in which our method is always depicted as the dotted blue line with circle marks. In both top and bottom rows the lower, the better. Hence, any mark above the dotted line is worst than the method proposed here. Specifically, the top row shows that our method attains a KL divergence similar to the one between test and train set distributions (green diamonds). These results are confirmed in the bottom row, although the LSTM model fits better for slow sperm. However, LSTM requires four representation vectors in a row to generate the video sequence of slow spermatozoa while our method only requires one.

4.4 Evaluation of synthetic sperm motility

The evaluation of the sperm morphology takes into account the distribution of control points without considering the translation along the sequence. To assess the motility is equivalent to measuring the similarity of the trajectories generated with respect to those in the sequences of the test set. However, a generated sequence and a real sequence of progressive spermatozoa may be significantly different with respect to the path followed and still be similar in terms of in terms of distance traveled. In order to compare them, we generate 10 trajectories for each real trajectory of each type and use the Average Displacement Error (ADE) and the Minimum Average Displacement Error (MADE), which are two popular motion prediction metrics. Given that the generated trajectories are conditioned by the first frame, both generated and real trajectories begin at the same coordinates. Therefore, on average, the generated trajectories are not expected to diverge significantly from the real ones. However, ADE approaching zero is a warning of overfitting, since there is no generation but copycat of the true trajectory. Consequently, the only information needed is the coordinates \((x_t^{(i)}, y_t^{(i)})\) for every detected real spermatozoon \(\zeta ^{(i)}\) in every frame t, along with the generated ones \((\hat{x}_t^{(i)}, \hat{y}_t^{(i)})\).

Fig. 4
figure 4

(Left) Average Displacement Error, and (Right) Minimun Average Displacement Error; both with respect to the head of the spermatozoon. The lower, the better in both plots. The dotted line is the method proposed; \(+\) , \(\star \) and \(\diamond \) represent LSTM, Gaussian and Train sets respectively

Fig. 5
figure 5

Distributions of spline control points and respective flagellum for progressive spermatozoa in the generated synthetic set, test set and train set

The results are depicted in Fig. 4, which follows the same legend as Fig. 3; and again the lower, the better too. Thus, our proposed method (dotted blue line) performs better or equally well as the real video sequences and the baseline models on average considering ADE or MADE. The only exception is the evaluation of the slowly progressive sperm based on MADE. However, we stress that MADE takes into account only a single trajectory (the one with the minimum displacement error) per method, while the average across all trajectories (ADE) is similar for all methods for this sperm type. In addition, our method outperforms the baselines by much larger margin for the other two sperm types.

4.5 Qualitative evaluation

We present some qualitative results to assess the realism of the generated data in terms of morphology, motility and final rendering.

4.5.1 Morphology

With respect to the morphology, we show how the spline control points are distributed within the window in the upper row of Fig. 5, as well as the rendered spline in the lower row. For the sake of clarity and compactness, we only depict the progressive sperm generated by our method together with the training and test sets. Five clusters with a similar distribution are clearly visible in the three upper subplots, the only difference being the number of samples. Likewise, in the lower row, the distribution of rendered splines looks the same and it is consistent with the compared of progressive sperm.

Next, we show the morphology as the sperm travels through the frame in Fig. 6, in which two synthetic 8-frame sequences (middle and bottom rows) are confronted with a real sequence.

Fig. 6
figure 6

A sample of single spermatozoon generation in the spline model domain. The upper row shows a 8-frame sequence extracted from the real video, the bottom two rows show two different generated sequences

4.5.2 Motility

To illustrate the motility, we present 10 generated trajectories (solid blue) vs. 1 real trajectory (dotted red) for each type of sperm in Fig. 7. The upper row is the method proposed and the lower row is generated with the LSTM. According to the quantitative results in Fig. 4, both models perform quite similar in terms of ADE. However, in Fig. 7, it can be appreciated that the proposed method produces longer and much more diverse paths than the LSTM.

Fig. 7
figure 7

Ten generated trajectories (solid blue) vs. one real trajectory (dotted red) for each type of sperm

4.5.3 Frames rendering

We compare the final rendering of the method with real frames in Fig. 8(a), in which the left plots are two samples of how a generated frame looks like before being transformed with the Cycle GAN, and the right plots are the outcomes of the Cycle GAN, hence synthetic frames. On the other hand, Fig. 8(b) shows how the cycle GAN transforms real frames (left) into schematic frames that seem to be extracted from the domain of the generated frames (right). Further qualitative comparisons of the other types of sperm are given in the Appendix.

Fig. 8
figure 8

Cycle-consistency between schematic domain and true domain. (a) From the generated schematic to the generated real-looking frame. (b) From the true frame to its version in the schematic domain

Fig. 9
figure 9

Results of the classification of images as real by qualified and other experts: (a) ROC curve; (b) precision, recall and F1-score

4.5.4 Expert evaluation

A survey was conducted to assess the ability of experts to distinguish between real and synthetic images and videos. We recruited 19 participants: 3 from medicine and biology, 7 from the computer vision community, 7 with an interdisciplinary profile including both areas, and 2 from other fields. 16 out of 19 reported experience working with microscopic images, and 7 out of 19 with sperm images. Therefore, we categorized participants into two groups: sperm imaging specialists, referred to as Qualified Experts, and Other Experts. Each respondent was presented with 6 images, 3 real and 3 synthetic, one after the other and randomly ordered; and was asked to indicate to what extent they considered each image to be real or synthetic according to the following excluding options: synthetic, likely synthetic, unsure, likely real, or real.

To analyse the survey, we use the response to each image as a proxy of the likelihood that the image viewed belongs to a real video and assign the following values:

\(\begin{array}{rccccc} \text {Answer}: & \textit{synthetic} & \textit{likely synthetic} & \textit{unsure} & \textit{likely real} & \textit{real} \\ p(y={\text {`Real'}}|x) = & 0.01 & 0.25 & 0.5 & 0.75 & 0.99 \\ \end{array}\)

Thus, each expert is treated as a classifier returning \(p(y={\text {`Real'}}|x)\) and a confusion matrix and ROC curve is obtained. The averaged results are shown in Fig. 9. The first remark is that both qualified experts and the rest of participants attain similar scores. The ROC curve, shown in Fig. 9(a) is close to the diagonal (dotted line), indicating that both groups perform similar to a random guess. The Precision, Recall and F1-Score metrics are shown in Fig. 9(b). With a confidence threshold of \(\ge 0.75\), a precision value of 0.5 evidences that the half of the images classified as real and likely real by qualified experts were actually synthetic. As the threshold is raised to \(\ge 0.99\) (absolute certainty), qualified experts classified only \(19\%\) of real images correctly, while, the other experts identified just \(6\%\) of real images correctly. These results support the claim that synthetic images cannot be distinguished from real.

4.6 Application-related validation

Meaningful validation of the generated videos is challenging, as it requires assessing them within the context of the commercial CASA for which it is intended for. To overcome such a challenge, we propose training and using an object detector such as the YOLO v5 network to detect sperm on the two datasets:

  1. 1.

    a human-labeled dataset consisting of 42 frames that accounts for a total of 6938 spermatozoa.

  2. 2.

    a synthetic dataset consisting of 672 frames from 28 real-looking videos generated with the method proposed, which makes a total 117,331 spermatozoa.

We train three YOLO v5 models: 1) with 80% of the human-labeled dataset, 2) with 80% of the synthetic dataset, and 3) with the union of the two previous ones. Similarly, we test the above models with the remaining 20% of the human-labeled, synthetic and union of both datasets. The results are shown in Table 2.

4.6.1 Training YOLO-based sperm detection

Assuming that there is a scarce ‘human-labeled’ dataset and it is held for testing, can we use the synthetic dataset for training? The answer is obtained from the ‘Human-labeled’ column of Table 2. Training with synthetic data attains slightly higher F1-score than training with a human-labeled data (that we had kept for the sake of this comparison). The union of both training sets achieves a similar F1-score with the human-labeled test set. In other words, the synthetic dataset by its own is capable of training a YOLO-based sperm detector at least as good as a human-labeled dataset, if the latter is not available.

Table 2 F1-score of three YOLO v5 models trained with human-labeled, fully annotated synthetic video (ours) and both

4.6.2 Testing YOLO-based sperm detection

Assuming that YOLO-based sperm detector is a CASA whose performance we aim to evaluate, can we use the synthetic dataset for testing? We utilize synthetically generated videos, which constitute a fully annotated dataset, for evaluation purposes. Hence, we focus on the ‘Synthetic’ and ‘Human+Synth’ columns of Table 2. If the training set is human-labeled, the F1-score attained across all the three test sets is similar. However, the confidence in the results is significantly higher for ‘Synthetic’ and ‘Human+Synth.’ test sets, as both are 17 times larger than the ‘Human-labeled’ set. Conversely, if we train with a large fully annotated dataset (rows ‘Synthetic’ and ‘Human+Synth.’), the F1-score surpasses 90%.

5 Conclusions

We presented a novel framework for generating realistic synthetic videos of sperm. This framework aims to address limitations that currently hinder the integration of deep learning techniques into CASA systems. Our approach generates labeled videos, including head and flagellum morphology annotations, which can serve as a surrogate for human-labeled data. These synthetic videos have been shown to be useful for training and evaluating CASA systems relying on deep learning detection networks.

A parametric spermatozoon model was proposed to capture the key morphological and motility features of spermatozoa to simulate their trajectories across video frames. Subsequently, a Denoising Diffusion Probabilistic Model was utilized to learn spermatozoa behavior. This enabled the animation of the parametric model, thereby generating realistic motion patterns. These patterns were then embedded into frames to produce schematic video sequences. Finally, the generated videos were rendered with a realistic appearance using a CycleGAN for style transfer.

We conducted experiments to assess the goodness of the morphology, motility, appearance, and utility of the generated videos. The experimental evaluation confirmed that the proposed method generates spermatozoa trajectories that align with real data distribution.

Despite the encouraging results, our focus has remained on a private dataset utilized by a commercial CASA system. Extending its applicability to other datasets may require adaptations, such as a dataset-specific spermatozoon windowing procedure or manual annotation. This necessity arises due to the lack of openly available labeled datasets that meet the requirements of our method. These requirements include access to tracking labels for spermatozoon heads, and sperm concentrations that allow for the extraction of trajectories without intersections between spermatozoa. To the best of our knowledge, four currently available datasets provide sperm head detections [19, 36,37,38]. However, some of these datasets exhibit limitations, such as excessively high sperm cell concentrations [19, 37], low contrast, monochromatic images [38], or noisy and artifact-prone images [37]. Notably, none of the datasets provide annotations of the flagella. Despite these limitations, the core components of the method, namely the parametric spermatozoon model, the DDPM, and the CycleGAN, are not specifically related to a particular dataset or CASA. In fact, different videos only require adjustments in hyperparameters such as the window size, the neighborhood radius, the number of control points, the number of frames, etc.

The findings of this work indicate that synthetic data can serve as a valuable resource to support the evaluation of CASA systems and the training of deep learning algorithms for sperm analysis. Furthermore, these findings could potentially enable additional research in this field, such as conducting more detailed analyses of flagellar beating patterns, encouraging the development of new deep learning methodologies for sperm analysis, or serving as an educational resource for laboratory personnel.