Keywords

1 Introduction

Image registration is a crucial task in medical image analysis for organ motion compensation, longitudinal tumor follow-up, pre- and post-operative image matching or cohort analysis [1]. When coupled with a segmentation task through label transmission, it can usefully reduce the number of expert interventions required to accurately segment anatomical or pathological structures, especially in pre-operative surgery planning. An alternative to avoid performing costly dense registration on highly-resolved 3D images is to reach semi-dense registration by finding correspondences between structural entities. In this context, the use of supervoxel over-segmentation techniques followed by supervoxel matching have been recently introduced [2,3,4]. Supervoxel over-segmentation methods such as Simple Linear Iterative Clustering (SLIC) [5] provide reliable support regions by grouping adjacent voxels into intensity-homogeneous regions. While preserving image geometry and object contours, supervoxels are able to drag dense - millions to billions of voxels - registration problems to semi-dense - thousand of supervoxels - tractable problems, bringing computational and memory gains.

This paper focuses on semi-dense pairwise supervoxel-based registration, which consists in establishing pairings between supervoxels decomposing two 3D images. Matching can be performed by aggregating voxel-wise labels within supervoxel boundaries of the second image, based on the results of a classification algorithm trained on the first image [2, 3] or by direct nearest-neighbor search [4, 6]. Those algorithms all depend on intensity similarity features which can be easily corrupted by uniform areas, appearance changes or different noise levels between images. Thus, the need for other multivariate and invariant representations of supervoxels arises naturally. In this paper, we focus on spectral features which have proven to be robust in dense registration problems [7].

Spectral graph methods are a variety of graph-based techniques, commonly used in dimension reduction as for Laplacian eigenmaps [8], whose goal is to find a non-linear and low-dimensional representation of high-dimensional data. Conversely, we aim at obtaining a high-dimensional supervoxel representation from 4D data arising when considering supervoxel averaged intensities and centroid positions. The spectral representation achieved at the supervoxel level relies on spectral graph decomposition followed by spectrum rearrangement. Since such features exhibit smooth spatial variations, we additionally introduce context-rich features extended from voxels [2, 3, 9] to supervoxels to describe the broadened spatial context on spectral representations. Accurate supervoxel correspondences are finally obtained by nearest neighbor search on these mid-level features.

Our approach is evaluated against state-of-the-art methods for semi-dense longitudinal registration of abdominal CT scans which remains an open issue due to wide organ size, shape and appearance heterogeneity.

2 Problem Formulation

Automatic semi-dense image registration is considered between two 3D images \(I_{f}\) and \(I_{s}\). Let \(\mathcal {F}=\{f_{i}\}_{i\in \{1,\ldots ,|\mathcal {F}|\}}\) and \(\mathcal {S}=\{s_{j}\}_{j\in \{1,\ldots ,|\mathcal {S}|\}}\) be respectively the set of \(|\mathcal {F}|\) and \(|\mathcal {S}|\) connected supervoxels partitioning \(I_{f}\) and \(I_{s}\) and obtained using a 3D extension of SLIC [5]. Our goal is to establish supervoxel correspondences by estimating a function h that maps each supervoxel \(f_{i}\in \mathcal {F}\) of \(I_{f}\) to a supervoxel \(s_{j}\in \mathcal {S}\) of \(I_{s}\), such that:

$$\begin{aligned} {\forall i \in \{1,\ldots ,|\mathcal {F}|\}, \exists j\in \{1,\ldots ,|\mathcal {S}|\} | h(f_{i})=s_{j}} \end{aligned}$$
(1)

We focus on supervoxel matching through straightforward nearest neighbor search at the supervoxel level. This considerably alleviates the computational and memory issues raised when a voxel-wise classification scheme [2, 3] is followed. Let \(\psi (f_{i})\) (resp. \(\psi (s_{j})\)) be the set of mid-level features assigned to \(f_{i}\) (resp. \(s_{j}\)). The optimization scheme can be easily written as:

$$\begin{aligned} {h(f_{i}) {= s_{j}} = \arg \min _{ s_{l} \in \mathcal {S}} \Vert \psi (f_{i}) - \psi (s_{l}) \Vert _{2}} \end{aligned}$$
(2)

Since intensity-based similarity features [4, 6] are not robust enough to discriminate supervoxels, alternate invariant representations of supervoxels are required. Our contributions rely on the combination of new feature categories built at the supervoxel extent: mid-level spectral and context-rich features.

3 Mid-Level Spectral Features

Spectral dimension reduction methods aim at changing the representation of high-dimensional data to a low-dimensional description with a few variables only. This embedding is expected to be as faithful to the data geometry as possible by preserving inner manifold distances in high-dimensional space, while striving to constrain the low-dimensional structure in an Euclidean space. We propose to adapt this concept in the inverse direction to provide a powerful multivariate representation of supervoxels, starting from low-dimensional 4D data arising when considering supervoxel averaged intensities and centroid positions only. In practice, we exploit for each image \(I_{f}\) and \(I_{s}\) an adjacency graph over the set of all supervoxels in order to obtain a spectral representation of chosen dimension.

3.1 Laplacian Graph Building

In the dense context [7], one usually constructs a graph where each node is a voxel and each vertex is weighted by the distance from neighboring voxels based on spatial and intensity similarities. In practice, the dense approach constrains to use only adjacency between voxels and their closest neighbors because a full matrix would be intractable from a memory and computational point of view. Conversely, we use a dense graph over the generated supervoxels. We exploit the same criteria of spatial and intensity proximity to define the adjacency graph using supervoxels as nodes and a distance between averaged positions and intensities as edges. We can thus define a dissimilarity matrix W as:

$$\begin{aligned} W_{i,j} = |X_i - X_j| / S_X + \alpha |I_i - I_j| /S_I \end{aligned}$$
(3)

where ij are indices of two supervoxels, \(S_X = \sigma (\{|X_i - X_j|\}_{i,j})\) and \(S_I = \sigma (\{|I_i - I_j|\}_{i,j})\) are respectively the standard deviation of differences in centroid positions and averaged intensities (X and I) of supervoxels. The balance between both terms is adjusted using \(\alpha \). We then define an affinity matrix A by using a Radial Basis Function (RBF) kernel over W such that: \(A = \exp (-\gamma *W)\). The kernel introduces a non-linearity in the representation, which allows to capture the non-linear modes of intensity displacements within images. Finally, the Laplacian graph is built by performing a symmetric normalization of A:

$$\begin{aligned} L = I-D^{-1/2}AD^{-1/2} \end{aligned}$$
(4)

where I is the square identity matrix of size \(|\mathcal {F}|\) for \(I_{f}\) or \(|\mathcal {S}|\) for \(I_{s}\), and D is the diagonal matrix where each diagonal coefficient corresponds to the sum of coefficients along the rows of L. By normalizing the Laplacian, we construct a Markov chain over the set of supervoxels whose eigenvectors can intuitively be interpreted as diffusion modes over the supervoxels graph.

We are now interested in computing the smallest eigenvectors of L. This can be done with any power-method based algorithm [10] for efficiency or by computing the full spectrum of L and keeping only the smallest eigenvalues. Since \(|\mathcal {F}|\) and \(|\mathcal {S}|\) are usually small, the computational time for the full spectrum is, in practice, in the order of a few seconds. While we lose resolution by aggregating voxels together, we consider the adjacency between all supervoxels, and not only the closest ones as imposed in dense scenarios. This allows to better capture global as well as local modes of intensity diffusion within images.

3.2 Rearrangement of the Spectra

The graph building procedure is applied to \(I_{f}\) and \(I_{s}\). Since we compute eigenvectors of two different matrices and expect them to be similar, we should keep in mind that they are defined up to the sign and to a positive multiplicative constant. To have meaningful correspondences of eigenvectors between two Laplacian matrices, we need to ensure that they match adequately. A second problem is related to numerical issues: eigenvalues corresponding to the same diffusion modes in the two images are usually not sorted in the same way. Thus, corresponding eigenvectors between two Laplacians are defined up to a permutation.

Fig. 1.
figure 1

First three eigenvectors with eigenvalue in increasing order without (a) and with (b) permutations and sign flips (supervoxel graphs built on two close CT scans).

Figure 1a presents an example of supervoxel spectral features without matching correction. We can see matching errors due to a permutation and a sign flip. To alleviate both problems, we spread the eigenvectors to \([-1,1]\) by histogram equalization to eliminate the multiplicative problem and ensure that eigenvectors are globally discriminative. One can then either perform linear assignment with the Hungarian algorithm [11] over the joint sets of eigenvectors and their opposites or maximize the correlation in absolute value between eigenvectors, thus finding iteratively the correspondence and the sign for each one. Figure 1b shows how the Hungarian algorithm can correct these errors.

Graph decomposition and rearrangement lead to a N-dimensional feature vector \(\varPhi (f_{i})=\{\varPhi _{n}(f_{i})\}_{n\in \{1,\ldots ,N\}}\) for each \(f_{i}\in \mathcal {F}\) where \(\varPhi _{n}(f_{i})\) is the i-th element of the n-th eigenvector built from \(I_{f}\). The same representation \(\varPhi (s_{j})\) is adopted for each \(s_{j}\in \mathcal {S}\) based on eigenvectors generated from \(I_{s}\) and belonging to \(\mathbb {R}^{N\times |\mathcal {S}|}\).

4 Mid-Level Context-Rich Features

The mid-level spectral representation leads to smoothly varying spatial features (Fig. 1). We thus supplement spectral features by introducing context-rich features extended from voxels [2, 3, 9] to supervoxels. We do not estimate mid-level context-rich features on source images [9] as usually performed but extract them on spectral features to describe their broadened spatial context.

Each supervoxel \(f_{i}\) of \(\mathcal {F}\) is described by its barycenter \(c(f_{i})\) and its mid-level spectral features \(\varPhi _{n}(f_{i})\). Following [4] and similarly to pixel patches, we exploit a supervoxel patch structure called superpatch. A superpatch \(P_{r}(f_{i})\) of radius r centered on \(f_{i}\) aggregates supervoxels \(f_{l}\in \mathcal {F}\) such that \(\left\| c(f_{l})-c(f_{i}) \right\| _{2} \le r\). Thus, the superpatch \(P_{r}(f_{i})\) is built by considering all supervoxels located within a fixed radius r with respect to \(f_{i}\) and contains necessarily \(f_{i}\) at least. Spectral features for each superpatch can be estimated by averaging spectral features within its boundary \(\varPhi _{n}[P_{r}(f_{i})]=\frac{1}{|P_{r}(f_{i})|} \sum _{f_{l} \in P_{r}(f_{i})} \varPhi _{n}(f_{l})\).

The superpatch concept allows a straightforward extension of context-rich features from voxels [2, 3, 9] to supervoxels. In particular, for each eigenvector of index \(n\in \{1,...,N\}\), we can assign M mid-level context-rich appearance features \(\psi (f_{i},n)=\{\psi _{m}(f_{i},n)\}_{m\in \{0,\ldots ,M\}}\) to each supervoxel \(f_{i}\) following:

$$\begin{aligned} \psi _{m}(f_{i},n) =\varPhi _{n}[P_{r}(\mathcal {N}_{\varDelta }(f_{i}))] - b \times \varPhi _{n}[P_{r'}(\mathcal {N}_{\varDelta '}(f_{i}))] \end{aligned}$$
(5)

where \(\mathcal {N}_{\varDelta }(f_{i})\) defines an extended supervoxel neighbor of \(f_{i}\) reached by applying displacement \(\varDelta \) starting from barycenter \(c(f_{i})\). Thus, \(\mathcal {N}_{\varDelta }(f_{i})\) corresponds to the supervoxel containing location \(c(f_{i})+\varDelta \). Displacements \(\{\varDelta ,\varDelta '\}\) are randomly defined in a ball of maximal radius \(\varepsilon \). Radii \(\{r,r'\}\) are computed following \(r=q\times \frac{\kappa }{2}\) where \(\kappa \) is an averaged inter-barycenter distance (estimated among all adjacent supervoxels) and q designates an integer randomly generated within \(\{1,2,...,Q\}\). \(b\in \{0,1\}\) is a binary parameter which selects whether the average spectral feature differences between two superpatches randomly located in the extended neighborhood of \(f_{i}\) (\(b=1\)) or the value obtained from one single superpatch \(P_{r}(\mathcal {N}_{\varDelta }(f_{i}))\) only (\(b=0\)). Averaged spectral features around \(f_{i}\) are included in the feature vector \(\psi (f_{i},n)\) by forcing \(\varDelta =b=0\) for all possible radii r.

By randomly generating many different radii r and offsets \(\varDelta \), we obtain M features describing the extended spatial context at a mid-level extend for spectral representations corresponding to the n-th eigenvalue. For each eigenvalue, \(\{\varDelta ,\varDelta ',r,r',b\}\) are randomly generated once and remain similar for each supervoxel and each image. Conversely, these parameters differ from one eigenvalue to another. The combination of spectral and context-rich features leads to \(N\times M\) features \(\psi (\cdot )=[\psi (\cdot ,1),\psi (\cdot ,2),...,\psi (\cdot ,N)]\in \mathbb {R}^{N\times M}\) assigned to each supervoxel.

5 Results

Experiments focus on data collected from 45 images pairs from 18 examinations, stemming from 7 patients with hepato-cellular carcinoma. Each pair brings together two dynamic contrast-enhanced (DCE-)CT scans acquired for the same patient and the same phase (before injection, arterial, early venous or late venous), at different time points varying from 41 to 700 days (226 in average). Image pairs are processed both forward (FW) and backward (BW) to get mapping functions \(h_{\texttt {FW}}(f_{i})=s_{j}\) and \(h_{\texttt {BW}}(s_{j})=f_{i}\) for each supervoxel \(f_{i}\in \mathcal {F}\) and \(s_{j}\in \mathcal {S}\).

Since supervoxel matching allows a straightforward propagation of anatomical labels, we evaluate the proposed methodology on liver label propagation results using DICE scores comparing liver propagation and ground-truth (GT) masks. Supervoxel pairings are also assessed using FW/BW inconsistency scores defined for each supervoxel similarly to \(\mathtt {inc}(f_{i}) = \left\| c(f_{i}) - c(h_{\mathtt {BW}}(h_{\mathtt {FW}}(f_{i})))\right\| _{2}.\)

The dataset is used to compare the proposed supervoxel matching combining mid-level spectral and context-rich features (sm-SC) with baseline techniques including supervoxel matching through mid-level intensity histogram (sm-I), spectral (sm-S) and intensity-based context-rich features (sm-IC) only, state-of-the-art SuperPatchMatch [4] (spm-I) and unsupervised learning-based voxel-to-supervoxel mapping using pixel-wise intensity-based context-rich features [2] (uvm-C). Supervoxel pairings for sm-{I,S,IC,SC} are established through nearest neighbor search contrary to uvm-C which employs voxel-wise random forests (RF) [12]. sm-I and spm-I use 30 bins intensity histograms. Strategies based on spectral features (sm-{S,SC}) use \(N=17\) eigenvectors. The total feature number for sm-{C,SC} and uvm-C is 600 (595 for sm-SC with \(N=17\) and \(M=35\)). Affinity matrices for sm-SC are computed with \(\alpha =0.1\) (Eq. 3) and \(\gamma =0.1\). Superpatch radii are estimated with \(Q=4\) for sm-SC and \(r=\frac{3\times \kappa }{2}\) for spm-I (Sect. 4) whereas averaged intensities for usm-C are considered with box sizes {3,5,7,9}. spm-I employs 6 iterations including propagation and random search. The maximal ball radius used to define extended neighbors for {sm-SC,usm-C} is \(\varepsilon =125\).

Partitions are made of \(|\mathcal {F}|=|\mathcal {S}|=2000\) supervoxels defined within areas with intensities in the \([-200,200]\) Hounsfield unit range. A SLIC [5] compactness of 0.2 with intensities rescaled to [0, 1] reaches a good trade-off between compactness and boundary adherence. Methods are applied with two SLIC decompositions: without and with a-priori liver segmentation awareness to evaluate supervoxel matches without being disturbed by boundary adherence issues. With a-priori, decompositions are separately performed for liver and other abdominal structures depending on relative volumes and merged to obtain one single partition.

Table 1. Quantitative comparisons of the proposed supervoxel matching combining mid-level spectral and context-rich features (sm-SC) with baseline techniques (sm-I, sm-S, sm-IC, spm-I [4], uvm-C [2]) through liver label propagation (DICE) and inconsistency (inc) averaged over the database (FW and BW). Best results are in bold.
Fig. 2.
figure 2

Assessment of sm-{S,SC} with supervoxel decomposition and pairings, GT and propagated liver mask (without a-priori).

Average DICE and inc scores, displayed in Table 1, demonstrate that sm-SC achieves the highest segmentation accuracy along with the lowest inconsistency in both configurations. Without liver segmentation awareness, significant gains in DICE (inc) arise with 86.28 (7.46) against 83.26 (8.01) and 81.83 (15.89) for sm-S and uvm-C respectively. More substantial gains arise using a-priori with DICE improvements of 6.7 (6.3) with respect to sm-S (uvm-C). sm-{S,SC} outperforms uvm-C whose computational and memory requirements are much higher since voxel-wise RF are used. Results show that intensity features ({sm-I,spm-I}) are not robust enough to discriminate supervoxels without context-rich or spectral descriptions. Compared to sm-S, applying context-rich features on spectral representations (sm-SC) reaches gains around 6.7 (3.0) and 0.9 (0.55) in DICE and inc with (without) a-priori. Comparing sm-IC and sm-SC reveals that context-rich features are more efficient on spectral than CT intensity data. The significant impact of the newly designed mid-level features are highlighted in Fig. 2 illustrating the accuracy of both supervoxel matching and liver propagation.

6 Conclusion

This work addresses automatic semi-dense image registration relying on robust supervoxel matching. We propose new multivariate supervoxel representations embedded in nearest neighbor search. Feature extraction is performed at the supervoxel level and combines mid-level spectral and context-rich information. Spectral graph decomposition and spectrum rearrangement are involved to capture global modes of intensity diffusion within supervoxel adjacency graphs. Context-rich features extended from voxels to supervoxels are employed to describe the extended spatial context on resulting spectral representations. Experiments on liver label propagation between CT image pairs show that our strategy outperforms state-of-the-art methods while bringing computational and memory gains. Extending this work to multi-scale supervoxel decomposition would deserve further investigation to drive the matching process in a coarse-to-fine fashion. This work also gives new insights for registration initialization, towards more complex dense deformation model estimation. Finally, we plan to explore spectral supervoxel matching for multi-phases and multi-modal registration.