Keywords

1 Introduction

A typical radiation therapy treatment starts with a diagnosis based on Computed Tomography (CT), Magnetic Resonance Imaging (MRI), Single-Photon Emission Computed Tomography (SPECT), Positron Emission Tomography - CT (PET/CT), or a combination of medical imaging modalities [48]. A first step in radiation therapy is typically the acquisition of a so-called simulation CT scan, where the patient is positioned in its treatment position. The simulation CT serves as a quantitative patient model which is used in the treatment planning process. Typical tasks are anatomy delineation, optimization of the radiation beam entrance angles and dose calculation. This acquired CT scan serves also as the baseline anatomy of the patient. During the different treatment sessions, the patient will be positioned as closely to this simulated position as possible to minimize position uncertainties. For imaging during treatment nowadays, the vast majority of radiation therapy systems rely on x-ray imaging using flat panel detectors. Figure 1 demonstrates this kV image acquisition with a patient in treatment position.

Fig. 1.
figure 1

Varian TrueBeam radiation therapy delivery platform, acquiring kV images of a patient in treatment position.

Reasons for that are the relatively low complexity, versatile applications, and affordability of these x-ray imaging systems. The x-ray system is used mainly for position verification relative to the treatment planning position but also for during treatment motion monitoring and rather lately for plan adaptation. Dependent on the needs for the different tasks in radiation therapy the x-ray based imaging system mounted on the treatment delivery device (on-board imaging) can be used to acquire different types of images such as 2D x-ray projections, fluoroscopic projection sequences, 3D Cone Beam CT (CBCT) images, and motion resolved 4D CBCT images. Future trends are clearly to use the on-board imaging systems for additional tasks throughout the therapy such as direct organ segmentation and direct dose calculation based on the acquired 3D or 4D CBCT data and for soft-tissue motion monitoring during treatment. Finally, the patient being under fractionated treatment (up to 40 fractions) is sometimes in parallel monitored using CT, MRI, PET/CT, or SPECT/CT to assess the treatment response. This should preferably be (at least partially) replaced by on-board imaging-based procedures available at the treatment device. Technically, the imaging pipeline of the on-board imaging system starts with the x-ray imaging hardware consisting of the x-ray tube and the flat panel detector. The detector acquires 2D projections for example as single frames (triggered images) to perform an image match against the planning CT prior each fraction or to verify the internal anatomy at certain control points during radiation beam delivery within a fraction. The projections can be as well acquired as a sequence either from a single viewing direction to verify for example certain internal motion trajectories or during rotation of the treatment delivery device (typically called gantry) around the patient to perform motion management. An elementary part of the motion management is tracking internal structures of the patient on acquired projections. For volumetric imaging the system rotates around the patient and acquires a sequence of projections which allows to reconstruct a 3D CBCT image. This includes reconstructions that are resolved with respect to phases of respiratory motion (4D CBCT)  [75] and even respiratory as well as cardiac motion (5D CBCT)  [68]. Having the images at hand, deformable image registration and automatic segmentation  [26] are recently a topic of growing interest in the context of adaptive radiotherapy. In the following, we want to discuss where it is already state of the art or where we see the potential (or already evidence) to apply learning-based methods to improve X-ray guided radiation therapy and thus clinical outcome.

2 Motion Monitoring During Treatment

Unlike conventional radiation therapy delivery schemes that typically deliver 2 Gy per daily fraction over several weeks, Stereotactic Body Radiation Therapy (SBRT) delivers high doses in a few or even a single fraction (6–24 Gy, 1–8 fractions) [19, 63]. This allows for an increased tumor control and reduces toxicity in healthy tissues. To ensure a sub-millimeter accuracy for the high dose deposition in SBRT, during-treatment motion management is indispensable. Correct patient setup on the day of the treatment is accomplished by image registration on a 3D CBCT, followed by a couch adaptation. A position verification during delivery of the fraction could be a 3D-3D match at mid- and post-delivery time with a re-acquired CBCT. Alternatively, a 2D-2D match or a 2D-3D match at certain angles can be applied. In the above cases the treatment beam is interrupted and resumed after the position had been confirmed. The drawback is that during actual treatment delivery the therapist is blind to any motion that occurs. An excellent overview of classical state of the art motion monitoring methods with the beam on is given in [6]. Of interest are the tracking methods that use kV/kV imaging with triangulation for 3D position information (CyberKnife®), Sequential Stereo (Varian Medical Systems), which sequentially acquires 2D images with a single imager, and approaches based on a Probability Density Function (PDF). Proclaimed accuracy of the Euclidean distance in 3D space is 0.33 mm on phantom data and a time delay to acquire sufficient images for the 3D reconstruction is to be considered. Alternatively, updating the raw image stack with the latest 2D acquisition, smears out the position change that occurred on the last 2D image in the 3D volume that also contains all previous images that did not undergo this motion. Methods based on a PDF map based on pre-treatment-day acquired 4DCT data [36] proclaim an accuracy of the Euclidean distance in 3D space of 1.64 ± 0.73  mm. Deep learning methods are being developed to further improve these results and are being discussed below for both bony structures and soft tissue.

2.1 Tracking of Bony Structures

For spine tumors, neighboring vertebrae can serve as anatomical landmarks and periodically acquired 2D kilovoltage (kV) images during treatment allow for a fast detection model to compare the vertebrae positions to the reference positions for a specific gantry angle. A CTV to PTV margin, representing all expected uncertainties including motion during delivery, is recommended to be \({<}\)3  mm in  [49]. Recommendations on patient setup accuracy (positioning of the patient on the couch before delivery of the treatment) are \({<}\)2  mm for translations, \({<}\)3\(^{\circ }\)–4\(^{\circ }\) for roll and pitch and \({<} 10^{\circ }\) for yaw, according to  [5]. A dosimetric study for spine stereotactic treatments recommends a patient setup translational error \({\le }\)1  mm and a rotational error \(\le \)2\(^{\circ }\)  [79] while the rotational setup error recommendation is reduced to \({\le }\)1\(^{\circ }\) in  [15]. Note that in all above situations a 3D setup CBCT or other volumetric verification is available. However, as the patient is not supposed to move anymore after correct setup, the above recommendations can also be projected to in-treatment position monitoring. An intrafraction study of spine SBRT treatments that acquired CBCTs during treatment delivery reports position standard deviations of up to 1.3  mm, and this for each of the three main axes: chest-back, left-right and head-feet  [49].

In  [66] a Deep Learning (DL) model based on Mask R-CNN (Regional Convolutional Neural Network) is described for vertebra detection of the thoracic spine (T9–T12) and the lumbar spine (L1–L4). It differs from the above methods in two key aspects: First, the model does not rely on temporal imaging information, acquired prior to the delivery time instance where the position is verified. Second: The model generalizes for vertebra in a human corpus, which means no patient-specific information is needed or models need to be trained prior to treatment delivery. The model allows for a fast structure localization (\({<}\)2 Hz) on 2D kV projection images that are acquired during the VMAT treatment delivery. It allows assessing instant 2D position verification, using segmentation, along the delivery of the VMAT arc as well as sequential (delayed) 3D position verification, when the subsequent projection images are included in a Digital TomoSynthesis (DTS) or CBCT. Alternatively, making use of a stereoscopic dual imager setup, the 2D position pairs can be triangulated to obtain an instant 3D position. Intensity-Modulated Radiation Therapy (IMRT) and 3D Conformal Radiotherapy (CRT) can benefit from the DL model for fast structure localization as well, as long as 2D kV projection images are acquired. Typical model training times vary from 1–3 days on an Intel Xeon W-2102 2.90 GHz CPU with 32 GB RAM and an NVIDIA GeForce GTX 1080 Ti with 12 GB RAM. The model’s accuracy to detect and estimate motion is assessed offline using the well-known Mean Average Precision (mAP) metric  [57]. Although the mAP metric makes a lot of sense in the computer vision domain, from a clinical perspective there are other more important metrics to consider: In this study the motion of the 2D Centre of Mass (CoM) of the vertebra is assessed for the best model as identified by the mAP. The test data in the first assessment contains actual patient data. In addition, a patient-like full-body phantom with vertebrae (PIXY TPO-1067  [38]) in treatment position is moved in a controlled setup and the motion detection is assessed by the DL landmark detection model for vertebra and compared to its ground truth.

Fig. 2.
figure 2

Left: 2D projection image of a patient not previously seen by the model. Right: Vertebrae detected by the model and their derived CoMs (blue dots). (Color figure online)

An ordinary 2D kV projection image (Fig. 2, left) needs to be provided to the model, which returns a segmentation mask, a bounding box and a classification label (not shown) for each vertebra that is detected (Fig. 2, right). Additionally, the 2D CoM is calculated from the segmentation. Figure 3 summarizes the model performance on CoM motion detection (for isocentric shifts/rotations), based on 50 structures, detected on different projection angles uniformly distributed over the acquired arc. The motion was introduced in the head-feet (vertical) direction and the horizontal direction. Depending on the gantry angle this would correspond to a combination of a chest-back and a lateral motion. To detect a rotational change based on the CoM (a single point), at least 2 vertebrae CoMs are required. This study considers all vertebrae in the field of view. Figure 4 shows the probability of a tracking error in the range of 0–2 mm. The different curves show the probability at different shift amplitudes that were carried out as well as the probability when all shift amplitudes are evaluated together.

Fig. 3.
figure 3

Detection accuracy of CoM motion for four shifts in chest-back or lateral (equivalent to the horizontal direction on the image) and head-feet (vertical direction on the detector) direction (left graph). A combined shift-rotation was introduced as well, where its equivalent shift of the CoM in 2D was assessed. The right graph shows three rotation offsets for angular directions \(\alpha \) (right graph). Detections were performed for 5 vertebrae fully visible on all 10 projection images of a patient previously not seen by the model (blind data set). \(\varDelta \textit{\textbf{s}}\) is the 2D vector offset between both positions. \(\varDelta \alpha \) is the 2D angular offset between both positions. The orange line represents the median value, and the box comprises all values between quantiles 1 and 3. The whiskers are set to contain all values within 1.5 x IQR (Inter-Quartile Range: Q3–Q1).

Fig. 4.
figure 4

Probability of a tracking error \(\varDelta \textit{\textbf{s}}\) in the range of 0–2 mm. The different curves show the probability at the different shift amplitudes of Fig. 2, as well as the probability when all shift amplitudes are evaluated together.

The second assessment involves the PIXY patient-like full-body phantom with vertebrae. The results for a position change detection based on a single vertebra are shown in Figure 5. The data for one shift \(\textit{\textbf{s}}\) contains a total of 40 structure shifts, detected on projection images that are orthogonal to the chest-back or lateral axes (Gantry angles: 0\(^{\circ }\), 90\(^{\circ }\), 180\(^{\circ }\) and 270\(^{\circ }\)). In total three such shifts \(\textit{\textbf{s}}\) were analyzed: 25.46  mm, 11.31  mm and 4.24  mm. Figure 6 shows the probability of a tracking error in the range of 0–2 mm. The different curves show the probability at the different shift amplitudes that were carried out as well as the probability when all shift amplitudes are evaluated together.

Fig. 5.
figure 5

Detection accuracy of CoM motion of a phantom on the treatment couch for shifts in the chest-back (x) and head-feet (y) direction, where \(\textit{\textbf{s}}\) is the resulting shift vector. A total of 40 vertebrae positions were analyzed for each of the three shifts. \(\varDelta \textit{\textbf{s}}\) is the 2D vector offset between both positions. The orange line represents the median value, and the box comprises all values between quantiles 1 and 3. The whiskers are set to contain all values within 1.5 x IQR (Inter-Quartile Range: Q3–Q1).

Fig. 6.
figure 6

Probability of a tracking error \(\varDelta \textit{\textbf{s}}\) in the range of 0–2 mm. The different curves show the probability at the different shift amplitudes of Fig. 5, as well as the probability when all shift amplitudes are evaluated together.

The above results show a sensitivity for positional changes in the range of 1.5  mm, with a median below 0.5  mm. Combining positional information of all vertebrae visible on a single projection image yields a sub-millimeter motion detection up to the smallest shift of 1 pixel-equivalent on the detector. The experiments with the phantom confirm these results. Spine rotations above 1\(^{\circ }\) can be identified, at 0.5\(^{\circ }\) detection becomes unstable.

2.2 Soft Tissue Tracking

Soft tissue position tracking is crucial in motion monitoring during radiation therapy to ensure that high dose delivery is confined to the tumor (Fig. 7) and not to the surrounding healthy tissue. Significant efforts have been made to improve the accuracy and robustness of soft tissue tracking in the past [39], where kV x-ray imaging is probably the most commonly used imaging modality for a number of practical reasons. One big challenge is the lack of sufficient contrast between soft tissue and background, which makes it different from regular visual object tracking in computer vision [47]. To tackle this issue, different approaches have been proposed in the literature by either utilizing treatment planning information or exploiting medical physics knowledge in imaging. Machine learning, especially deep learning, algorithms have become increasingly popular in this domain.

Fig. 7.
figure 7

Example image for pancreas tracking  [41] on simulated CBCT (left) and short-arc CBCT (\(20^{\circ }\), right) with overlaid pancreas contour (yellow) and tracked contour (blue) (Color figure online)

Treatment planning imaging contains rich information about target characteristics and motion that can be represented by mathematical models or encoded in deep neural networks (DNNs). In [35], 3D diaphragm motion models were generated from segmented 4D CT images and then forward projected to the 2D X-ray panel geometry for diaphragm tracking. In [86], the simulation CT was deformed and transformed to generate enough synthetic digitally reconstructed radiographs (DRRs) along with known tumor locations. Then, a DNN was used to model the relation between DRRs and their corresponding bounding boxes of tumors. This model was applied to predict tumor locations in real projections acquired during the actual treatment.

Physics-based approaches aim at exploiting hardware advances to improve soft tissue contrast. Dual energy (DE) imaging and multi-layer detectors are among these promising technologies. For example, a fast-kV switching DE fluoroscopy was implemented on a bench top system by alternating between high and low x-ray energies. Bony anatomy was suppressed using the classical weighted logarithm subtraction (WLS) method [34]. A deep learning model was used to improve the accuracy of WLS [33]. In X-ray imaging, a stacked flat panel detector design allows to get a plurality of images with low and high signal to noise ratio (SNR) and high and low spatial resolution, respectively. Image fusion schemes are available to take advantage of such “low - high” and “signal - resolution” information to combine images together with the aim to maximize SNR of the fused image and prevent loss of spatial resolution [88].

3 CBCT Image Reconstruction

Nowadays, volumetric imaging is arguably an integral part of the workflow in radiation therapy. While initially it was mainly intended for bone-based 3D positioning of the patient, it has progressively become an important tool for soft tissue matching thanks to improvements in image quality of the reconstructed volume. Recently, these improvements allowed for adaptive radiation therapy where the treatment plan is being adapted to the anatomical changes directly during a fraction prior the actual treatment. Clearly, improving image quality is the key aspect of the successful deployment of volume reconstruction methods and the actual progress in machine learning brings new opportunities in this area.

A typical image reconstruction pipeline consists of pre-processing performed in the projection space, analytical or iterative reconstruction, and volume post-processing.

3.1 X-ray Projection Pre-processing

The pre-processing phase already provides several opportunities for a successful application of deep learning (DL) methods. A good example is the correction of signal degradation caused by X-ray photons that are scattered within the patient body  [22, 54]. The approach is to use Monte Carlo Methods as forward simulation of primary and scatter signal and train a U-net type regression to predict the scatter component out of the combined signal acquired by the flat panel detector.

Metal artifacts in CBCT are another prominent example where projection-based corrections with the help of DL show advances compared to classical approaches  [50, 52, 53, 60, 83]. The speciality here is that the X-ray beams penetrating metal objects are affected by various physical effects, namely beam hardening, increased scatter, and high noise because of the strong attenuation. This makes it nearly impossible to use the affected information directly and thus requires to include the prior knowledge into the reconstruction process.

3.2 CBCT Volume Post-processing

An obvious application of DL methods is post-processing of CBCT volumes performed to correct for inaccuracies and artifacts. A beneficial aspect here is that the generation of training data for supervised learning is often quite straightforward: A prominent scenario is to generate training pairs with the complete projection set as ground truth and a projection subset (sparse-view) as a simulation of low dose scans  [31, 43, 85]. All these methods apply classical filtered back-projection (FBP) to perform the first reconstruction affected by typical sparsity streaks and use neural networks, such as U-Nets, to improve the quality of the final image. The reconstruction and subsequent correction of limited-angle acquisitions have been addressed using a similar approach  [70, 80]. However, it has been pointed out that these approaches cannot guarantee that the output image faithfully represents the anatomy of the patient and does not fabricate fictitious structures due to the prior knowledge trained on a patient population. A possible mitigation is through the comparison of the reconstructed volume against the acquired projections by applying forward projections and enforce minimal differences  [46]. A hybrid approach has been proposed in  [37] to reduce artifacts related to the incompleteness of input projections due to limited-angle, sparseness and truncated acquisition: In the first phase, an U-net is employed to complete the insufficient input data. The completed set combining both the measured and computed projections is then reconstructed by conventional iterative reconstruction technique. A rather rarely addressed topic is the field of motion artifact reduction. This is presumably because of the general lack of motion-free ground truth training data. In  [61] we proposed a framework for CBCT motion artifact simulation and applied it in a proof of principal study [62] to train a U-net based artifact reduction method in the image domain (see Fig. 8).

Fig. 8.
figure 8

Examples of training data used for DL-based motion artifact reduction  [62] generated by applying 4DCT based motion simulation using recorded breathing curves  [61]. The columns show from left to right the simulation with motion artifacts X, the desired output representing the average motion during scan Y, the artifact image \(X-Y\), the image after applying the predicted artifact \(Y-P\), and the predicted artifact P.

3.3 Iterative CBCT Reconstruction Methods

Apart from FBP, iterative methods represent a common technique for CT image reconstruction. Here the reconstruction corresponds to a step-by-step minimisation of the objective loss function \(\ell (f,p,A)\) defined in terms of the reconstructed volume f, acquired projections p and the system matrix A relating the volume voxels to pixels in the projection space. The loss function \(\ell (f,p,A) = \psi (f,p,A)\,+\,r(f)\) consists of the data fidelity term \(\psi (f,p,A)\) enforcing consistency of the reconstructed volume with acquired projections and the regularization term r(f) encouraging reconstructions to satisfy a priori assumed properties, e.g. piece-wise smoothness. The common choices for the former include the \(L_2\)-norm projection error \(||A f - p||_2^2\) [28] or the statistical loss function [20, 65] taking into account the stochastic nature of signal detected in the projections. The regularization is often represented by variants of the total-variation [23]. During each iteration step, the update of f is calculated by comparing the forward-projected volume Af to the acquired projections; the mathematical formulation depends on the precise form of the objective loss function, the chosen regularizer and the iteration scheme [2, 28]. Machine learning techniques can then alter this general scheme in a number of ways.

The first set of methods includes learning prior information from a training dataset containing high-quality reconstruction or alternatively general images. The learned information is then used at each iteration step to enhance the quality of limited-angle or sparse-view reconstructions. The learned information can be as simple as texture content in similar patches [42] while deploying deep neural networks allows for extracting higher-level features and for greater expressivity. In [10], a deep residual convolutional neural network (CNN) was trained for image denoising on the COCO dataset [51] and then used as a filter at each iteration step. In [13], a CNN is trained on a dataset containing high-quality reconstructions to yield ground truth images by refining unfinished iterations. The trained CNN then defines a regularization term enforcing the volume to lie close to the ground truth.

In another set of methods, each iteration step is partially replaced by a deep neural network and the whole unrolled system is trained at once, having the subsampled set of projections as an input and high-quality reconstructions as target. Examples include [77] or [11, 17]; in the latter, a DenseNet-inspired network is used in each iteration step to propose an optimal volume update based on the current as well as the previous gradients of the loss function; this is in fact a generalization of the Nesterov momentum [78] used for the speedup of iterative reconstruction.

3.4 End-to-End CBCT Image Reconstruction Learning

The last class of reconstruction algorithms that we want to mention here is applying deep learning in an end-to-end approach where pre- and post-processing (in projection and volume domain) are jointly trained. Here one of the foundation papers by Würfl et al. [81] uses neural networks to learn filter and weighting in projection space while evaluating loss functions in the image domain. Other prominent examples are the previously mentioned methods for metal artifact reduction by Lin and Lyu et al. [52, 53] but also for the limited angle scans [30]. Zhu et al. [87] proposed to learn the complete reconstruction process including domain transformation between the projection and volume space. That implies learning the system matrix which is normally well known.

An alternative approach to DL-based end-to-end reconstruction is presented in  [24]: a continuum of intermediate representations is employed to break down the original problem, where line integrals are gradually restricted via partial line integrals until the level of image voxels is attained. The resulting hierarchy is mapped onto the network architecture, allowing for significant reduction of the computational complexity.

3.5 4D CBCT Reconstruction

In 4D reconstruction normally 10 volumes with respect to their respiratory phase are reconstructed from acquisitions with approximately the same number of projections as needed for a 3D scans and subsequently utilizes approximately the same dose. This leads to strong under-sampling of the phases and makes it even harder to obtain a certain image quality. Classical approaches try to join information from all phases by using an initial combined reconstruction (MKB)  [76], temporal regularization (4DTV)  [64], or by applying deformable registration between the phases (MoCo)  [7].

In [12] an iterative deep learning approach derived from the 3D AirNet method [11] has been applied to reduce sparseness streaks in 4D CBCT reconstructions. Zhang et al.  [84] proposes a motion compensated reconstruction algorithm applying deep learning for patient population based deformation field refinements. A method for the suppression of sparseness artifacts in cardiac CT imaging based on learned data exchange between phases with cyclic loss has been presented by Kang et al.  [44].

In motion resolved reconstructions the challenge is to overcome the sparse sampling of the motion resolved images by reusing information from other motion states. This implies that the ongoing motion needs to be resolved up to a certain extent, what makes the problem even more ill-posed and therefore prior information about anatomy and physiological motion needs to be taken into account. Further challenges are to overcome the phase correlated reconstruction and to address motion amplitudes [69].

4 Deep Learning for Organ Segmentation

For the generation of a radiotherapy treatment plan, the position of the tumor as well as surrounding organs need to be known. In a typical workflow, a clinician contours these structures on either a CT or an MRI image.

Previously, the automatic segmentation of anatomical structures was performed by heuristic algorithms, such as thresholding  [3] or watershed  [74] and joined to be applied for a complete anatomical site [29]. However, these algorithms need to be specifically designed for each organ. The advancements in deep learning now make it feasible to generate segmentation solutions for a multitude of different anatomical structures using the same or similar underlying neural networks. These are then trained on example segmentations to adapt to the particular structure. The underlying neural network is most often a convolutional neural network as its architecture is especially suited for image-related tasks. More specifically the U-Net  [67] and derivations from it, such as the Tiramisu  [40] and BibNet  [71], are commonly employed for anatomical segmentations.

One limitation of the above-mentioned networks is their incapability to learn strong shape priors. This leads to inadequate performance in case of weak image quality. Methods such as anatomically constrained neural networks  [59], try to circumvent it by forcing the network to learn a shape representation. With the described methods above, organ segmentation algorithms have been developed for many different anatomical sites such as abdomen  [9], female breast  [56, 71], head and neck  [58], female pelvis  [32], male pelvis  [73] and thorax  [18]. In the case of head and neck and male pelvis, performance on par with clinicians have been reported.

A special challenge is to perform automatic segmentation directly on CBCT reconstructions [1] due to its, in some aspects, inferior image quality compared to CT (see Fig. 9). Residual motion is apart from scatter one of the most prominent challenges that especially makes it hard to define the ground truth.

Fig. 9.
figure 9

Exemplary pancreas automatic segmentation result (red) on a CBCT volume  [1]. (Color figure online)

The integration of these algorithms in clinical practice faces one additional hurdle: The data used for training the deep neural network may originate from a different hospital or even geography. For a CT segmentation, there is evidence that for most structures the quality remains unimpaired by training on data from a different hospital as long as the segmentation guidelines are identical [72]. For MRI segmentation, the quality improves if data from the employing hospital is included in the training or as part of an on-boarding process [27]. An alternative way to overcome this challenge is the implementation of a distributed learning method that is able to leverage the data from multiple hospitals in a privacy-preserving manner [14].

5 Deformable Image Registration

It is of upmost importance in radiotherapy that the prescribed dose is being delivered to the target as conformal as possible while sparing neighboring organs at risk.

Patients anatomy changes from day to day. This might lead to misadministrating the dose, thus not fulfilling the clinical goals. To avoid this, adaptive radiotherapy was introduced (ART) [45]. Here, a patient image of the day is used to update the deprecated treatment plan. Furthermore, one needs to track the absorbed dose in each organ to ensure the correct dose coverage of the tumor and not overdosing the risk organs during the whole treatment based on multiple fractions. This process is called dose accumulation.

In both steps, deformable image registration (DIR) is being used. DIR is morphing the original image to the updated image set of the day. The “path” of each voxel is saved as a 3D vector in a deformation vector field (see Fig. 10) which can later be used to deform the dose as well for accumulation.

Fig. 10.
figure 10

A deformation vector field (arrows) to morph one CT (grey in the foreground) to the other CT (black in the background) of the same patient in different breathing phases.

However, due to the vast number of voxels that need to be compared and moved around based on optimization algorithms, a conventional DIR [8] can take up to minutes until it finishes. That is the first opportunity for DL to help. As soon as the patient is positioned and the images of the day are recorded everything should be fast to start dose delivery to avoid anatomy changes [21] or patient position changes. DL based DIR can be done within a fraction of a second [4, 16] for 3D image volumes, which is a considerable improvement over, e.g. 1 min.

Another point where DL might help to improve the results is later in the process of dose accumulation. Classical DIR algorithms follow a fixed set of parameters and therefore perform better or worse depending on the image which needs to be deformed. Furthermore, there are indefinite ways to deform one image to another and thus, no ground truth actually exists. That is why unsupervised learning is normally being used for these kind of projects, since supervised models would only mimic the behavior of the classical algorithm, i.e. copying its problems and uncertainties.

Unsupervised learning detects patterns on its own, and thus might be able to outperform existing solutions [4, 55].

Several architectures are being used and tested [25]. First approaches relied on typical U-Nets [4] and newer publications are looking into the potential of GANs  [82] to either generate the morphed image directly or “just” the deformation vector field.

6 Conclusion

As shown on several examples from the image guided radiation therapy field we see enormous potential of data driven methods to enhance or overcome state of the art algorithms. This can be observed in various stages of the imaging pipeline. Notably, the biggest improvements can be observed where learning based methods are used under consideration of domain knowledge (e.g. x-ray imaging physics) over pure black-box applications. We rate this as motivation to further explore problem specific network architectures and loss functions to obtain solutions that leverage physical or physiological constraints to reduce the solution space for the training process. Beyond this we see a lot of synergies between the described domain solutions where integrated solutions like e.g. deformable registration with implicit segmentation or image reconstruction with implicit deformable registration against prior acquisitions could be future developments. In conclusion we sense a wide agreement in the scientific community that deep learning will be the next evolutionary step in the field.