1 Introduction

Depth sensing technologies using structured light or time-of-flight become popular in recent years. Their applications have also been widely studied in the healthcare domain, such as patient monitoring [1], patient positioning [16] and computer-aided interventions [19]. In general, depth imaging provides real-time and non-intrusive 3D perception of patients that could be used for markerless registration, to replace conventional RGB cameras, and potentially to achieve higher robustness against illumination and other data variability.

To enable such clinical applications, one of the fundamental steps is to align the pre-operative image such as CT or MRI, with the real-time patient image from the depth sensor. This requires an efficient and accurate registration or ego-positioning algorithm. As depth sensors capture the 3D geometric surface of the patient while skin surface can be readily extracted from CT scans, surface-based registration methods [2, 14, 19] have been intuitively proposed. However, those methods usually fail to perform robustly due to several challenges: (1) the surface data obtained from the depth sensor is noisy and suffers from occlusions; (2) the surface similarity is tampered due to the patients’ clothing or protective covers; (3) the two modalities may have a different field of view. CT data, for example, often only covers a part of the patient’s body; (4) the patient’s pose/shape may vary between the two imaging processes. To overcome these challenges, most of the existing solutions still rely on marker-based approaches [5].

Another way to formulate the depth-CT registration problem is to utilize the internal body information that the CT scan naturally captures. Unfortunately, the physical principles used in the depth sensing and CT imaging are so different that the information from the two modalities has little in common. To measure the similarity between different modalities, learning-based algorithms have been actively explored [4, 15]. Most recently, there has been a significant progress in feature representation learning using deep convolutional neural networks, which can extract hierarchical features directly from raw visual input. The high level features encode large contextual information which are robust against noise and other data variations. Moreover, by combining deep convolutional neural network with reinforcement learning, the deep reinforcement learning (DRL) has demonstrated superhuman performance in different applications [10, 13].

In this paper, we propose a deep reinforcement learning based multimodal registration method that handles the aforementioned challenges. An overview of the system algorithm workflow is shown in Fig. 1. Our major contributions are summarized as follows: (1) We propose a learning-based system derived from deep Q-learning [13] that automatically extracts compact feature representations to reduce the appearance discrepancy between depth and CT data. It is the first time a state-of-the-art DRL method is used to solve the multimodal registration problem in an end-to-end fashion. (2) We also propose to use the contextual information for the depth-CT registration. Compared to conventional methods that compute surface similarities, our algorithm learns to exploit the relevant contextual information for optimal registration.

Fig. 1.
figure 1

Run-time workflow of the proposed DRL registration framework. The iterative observe-action process gradually aligns the multimodal data until termination.

2 Related Work

Registration of multimodal data recently attracts increasing attention on medical use cases. Different information is extracted and fused from different modality scans to provide pieces of an overall picture of pathologies. In general, most of the multimodal registration (MMR) approaches can be categorized as one of the two types. The first category algorithms attempt to locate invariant image features [2, 17], while the second category approaches apply statistical analysis such as regression to find a metric that measures dependency between two modalities [4, 7]. Different from those approaches, our method learns both feature representations and alignment metric implicitly in an end-to-end fashion with DRL.

DRL is a powerful algorithm that trains an agent which interacts with an environment, with image observations and rewards as the input, to output a sequence of actions. The working mechanism makes it suitable to solve the sequential decision making problems, for example the landmark detection in medical images with trajectory learning [6]. To the best of our knowledge, the most relevant registration work is proposed in [11]. They solve the 3D CT volume registration problem with a standard deep Q-learning framework. To speedup the training process with the 6 degree-of-freedom transformation, they replace the agent’s greedy exploration process with a supervised learning scheme. In our scenario, due to the appearance discrepancies as well as ambiguities due to missing observations, we instead encourage the agent to explore the search space freely rather than exploiting the shortest path. Furthermore, we utilize the history of actions to help agent escape from local loops caused by the incorrect initialization, which differentiates our work from theirs.

3 Method

We propose a novel MMR algorithm that aligns the depth data to the medical scan. Our work is inspired by the process of how human experts perform the manual image alignment, which can be described as an iterative observe-action process. Similarly, the DRL algorithm trains an agent with observations from environment to learn a control policy, which is reflected by the capability of making sequential alignment actions with given observations. The rest of the section will reveal more details of the proposed registration method.

3.1 Environment Setup

In deep reinforcement learning, the environment E is organized as a stochastic finite state machine. It takes agent’s action as the input and outputs states and rewards. The agent is designed to have zero knowledge about the internal model of the environment, besides the observed states and rewards.

States: In our setup, the state is represented by a 3D tensor consisting of cropped images from both data modalities. At the beginning of each training episode, the environment is initialized either randomly or roughly to align the two data sources. A fixed size window is applied to crop the depth image with current transformation, where the cropped image is stacked with the projected CT data (Sect. 3.3) as an output state. In the following iterations, a new action output from the agent is used to update the transformation accordingly.

Rewards: Given a state \(s_t\), a reward \(r_t\) is generated to reflect the value of current action \(a_t\) given by the agent. A small reward value is given to the agent during the regular exploration steps, while the terminal state triggers a much larger reward. The sign of the reward is determined by the current distance to the ground truth compared to the previous step.

Fig. 2.
figure 2

The derived dueling network architecture used in the proposed method.

3.2 Training the Agent

Let \(I_d\) represent the depth image and \(I_t\) represent the projected CT image. The goal here is to estimate the rigid transformation T that aligns the moving image \(I_t\) to the fixed image \(I_d\) with a minimal error. A common method to find the optimal parameters of T is by maximizing a similarity function \(S(I_d, I_t)\) with a metric. Instead of applying a manually defined metric, we adopt the reinforcement learning algorithm to implicitly learn the metric. The optimization process is recast as a Markov Decision Process following the Bellman equation [3]. More precisely, we train an agent to approximate the optimal action-value function by maximizing the cumulative future reward [13]. Different from the deep-Q network, the proposed method is derived from the Dueling Network [18] with some modifications (Fig. 2):

  • We add more convolution and pooling layers to make the network deep enough to extract high-level contextual features.

  • We add batch normalization layer after the input data layer to minimize the effect of intensity distribution discrepancy across different modalities.

  • We concatenate the feature vector extracted from the last convolution layer with an action history vector that records the actions of the past few frames. In our experiment, the concatenation of the action history vector alleviates the action oscillation problem around certain image positions.

The insight behind the dueling network is that certain states include more critical information than others to help the agent make the right decision. For example, during the chest region registration, getting the head region rather than the arms within the observation will significantly help the agent move toward the right direction. Compared to the deep Q-network, the dueling network has the capability of providing separate estimates of the value and advantage functions, which allow for a better approximation of the state values. In our setup, the final Q value function is formulated as:

$$\begin{aligned} Q(s,h,a;\theta ,\alpha ,\beta ) = V(s,h;\theta ,\beta )+(A(s,h,a;\theta ,\alpha )-max_{a'}A(s,h,a';\theta ,\alpha )) \end{aligned}$$
(1)

where h is the history action vector, \(\theta \) is the convolution layers’ parameters, \(\alpha \) and \(\beta \) are the parameters of the two streams of fully-connected layers. To further stabilize the training process, double DQN [8] is also adopted to update the network weights.

3.3 Data Projection

The two data modalities in our scenario are the 2.5D depth image and 3D CT volume data. One way to align the two modalities is to reconstruct the depth image to a 3D surface, and then apply the registration algorithm in the 3D space. However, feature learning with the 3D convolution requires tremendous computation. Meanwhile, the DRL algorithm with a greedy exploration policy has to explore millions of observations to properly train an agent. To reduce the computation complexity and speedup the training process, we reformulate the 2.5D-3D registration problem to a 2D image registration problem. We simplify the 3D volume data to a 2D image through a projection process. Note that the simplification is only for speedup purpose and the proposed workflow can be extended to the 2.5-3D registration with minor modifications.

To best utilize the internal information that CT data naturally captures, we project the CT volume to a 2D image using the following equation.

$$\begin{aligned} I_t(x, y) = \frac{1}{h} \times \sum _{z=0}^h CT(x, y, z) \end{aligned}$$
(2)

where h is the size of the CT volume along the anterior axis. The intensity of each pixel on the projected image is the summation of the voxel readings of the volume along the projection path. We apply an orthographic projection for both depth data and volume data, and Fig. 3 shows an example of the projected images. The projected image of volume data is visually similar to a topogram image. Since the medical scans often only have a partial view of the patient, it is challenging even for a human expert to align the two modalities from the surface, especially over the flat regions such as the chest and the abdomen. On the contrary, the topogram-like image reveals more contextual information of the internal structures of the patient to better handle the data ambiguity problem, compared to the surface representation.

Although the depth-CT registration involves a six degree-of-freedom transformation, we simplify the search space into two translations \(T_R\) (along the Right axis in the RAS coordinate system), \(T_S\) (along the Superior axis) and one rotation \(R_A\) (along the Anterior axis). The rest of the transformation can be determined/inferred through the sensor calibration process together with the depth sensor readings. For example, the relative translation offset along the Anterior axis can be calculated by deducting the actual distance between table and camera from the distance recorded during the calibration time.

Fig. 3.
figure 3

Orthographically projected CT and depth images. Left image shows a CT abdomen scan in a larger scale. Middle image shows a depth image rendered in color. Right image displays the overlay of the two modalities with the ground truth.

4 Experiments and Results

We installed Microsoft Kinect2 cameras to the ceilings of clinical CT-scan rooms. Depth images were collected when the patient lay down on the table and adjusted the pose for the scan. We took several snapshots during the positioning process. We reconstruct the depth image to a 3D point cloud and orthographically re-project the point cloud to a 2D image. We also reconstruct the patient’s CT data with full FOV to avoid cropping artifacts. The two imaging systems, Kinect2 and CT scanner, can be pre-calibrated through a standard extrinsic calibration process [12]. As long as the patient remains stationary during the two imaging processes, the ground truth alignment of the two data modalities can be determined from the table movement offsets and the extrinsic parameters.

We collect two datasets that consist of thorax and abdomen/pelvis scans, which ends up with 1788 depth-CT pairs across several clinical sites. We randomly split the training and testing set for each experiment and guarantee each training set have 800 data. The rest of them is used as the testing data. We also add random perturbations to the training data to avoid overfitting.

The network configuration is shown in Fig. 2. The input images are cropped with the same size (\(200 \times 200\)) at a resolution of 5 mm. The network output is a 6D vector (4 translations and 2 rotations). The action history vector has a length of 24 (6 actions \(\times \) 4 histories). We use RMSprop optimizer without the momentum to update network weights. The learning rate is initially set to 0.00002 with a decay of 0.95 every 10,000 iterations. The mini-batch size is 32. \(\gamma \) equals to 0.9. We randomly initialize the transformation with a translation offset \({\pm } 500\) mm and a rotation offset \({\pm } 30^\circ \) from the ground truth location, to start training the agent. The non-terminal rewards are \({\pm } 0.1\) and the terminal rewards are \({\pm } 10\). For each dataset, we train an agent with a single TitanX Pascal GPU for 1.2M iterations and each of the training lasts about 4 days.

System performance is reported as the average Euclidean distance between the network estimation and the ground truth. We compare the performance with several baseline approaches as well as different DRL networks. The landmark baseline [6] trains detectors to detect surface landmarks, such as shoulders and pelvises, to align with the CT anatomy landmarks. The Haustorff baseline minimizes the surface distance between CT and depth in 3D with the Haustorff metric. The ICP baseline aligns the two surfaces with the standard ICP algorithm. The DQN baseline is configured with the original setup [13]. The Dueling Network [18] is similar to our proposed method but configured with the original setup. We also test the proposed network without history information and batch normalization [9] separately. The quantitative accuracy comparison among all methods is shown in Table 1 as well as the computation time. A qualitative analysis of the results generated by the proposed method is shown in Fig. 4.

Table 1. Results comparison of thorax and abdomen (ABD) dataset.
Fig. 4.
figure 4

Qualitative impression with the proposed algorithm. Left image is a perfect thorax alignment. Middle one is a good thorax alignment though the patient’s poses at the two imaging time were different. Right image shows a perfect abdomen alignment.

5 Conclusion and Future Work

A novel depth-CT registration method based on deep reinforcement learning is proposed. Our approach investigates the correlations between surface readings from depth sensors and internal body structures captured by the CT imaging. The experimental results demonstrate that our approach reaches the best accuracy with the least deviation. The better performance compared to two original DRL methods suggests that our modifications improve the network learning for the multimodal registration. Higher errors in the abdomen cases, compared to the chest cases, may be caused by the larger appearance variations. The proposed approach also has no limitations to be applied to register images from other modalities. Future research direction includes combining the surface metric together with the contextual information to further improve performance. Extra efforts are also required to improve the training and testing efficiency.