Keywords

1 Introduction

Segmenting ribs in chest radiographs is used for the analysis of the lung parenchyma, as the overlaid ribs may obscure important findings. Rib shadows may be either excluded from the automatic analysis [1] or suppressed from the image [2] to minimize their impact. Ribs may also be used as an anatomical reference to automatically locate findings like lung lesions or to establish correspondence between different images (e.g., in follow-up acquisition). Further on, counting the ribs in the lung field is a standard radiological procedure used to assure proper inhalation state in chest X-ray quality assessment. Unlike the previous two, these applications require the ribs not only to be segmented, but also to be anatomically labeled correctly.

There is a number of methods described in literature to segment ribs in chest radiographs using either pixel classification [3], atlas registration [1], or a mixture of methods [4]. But there is no method described yet that does a robust anatomical labeling of posterior ribs. Even the atlas-based method did not use rib labels. Unlike CT where this task could be solved easier [5, 6], the upper ribs are often overlaid in a chest radiograph (by, e.g., clavicles and other ribs) in a way that may prevent an algorithm from identifying and counting all the upper ribs properly. Also using the lung field as reference space appears not to be sufficient to unambiguously assign an anatomic label to a detected rib.

In this paper, we propose an automatic and general approach for localizing and labeling spatially correlated point landmarks. We apply our approach specifically to rib localization and labeling in posterior-anterior chest radiographs, by formulating the problem as finding a key point on each rib near the rib center. Unlike previous methods, it is a general approach and does not make or need any assumption about the task. Instead, all model parameters are automatically learned from annotated training data. First, the fully convolutional network (FCN) U-Net [7] is used to generate localization hypotheses. Then, a conditional random field (CRF) is applied to assess spatial information between landmarks. For feasibility, the CRF state space is combinatorically defined by the U-Net-generated localization hypotheses. Since the CRF has no means to select other than these localization hypotheses, we introduce a novel “refine” label. This allows the CRF to select this label instead of any of the localization hypotheses in case, e.g., none of them presents a viable option w.r.t. the CRF model. A second inference is performed for all “refine”-labeled nodes on a local subgraph over all image pixels rather than the set of localization hypotheses. Applying our approach to 642 images of the publicly available Indiana chest X-ray collection [8], we are able to localize and label 94.6% of the 16 individual landmarks correctly, corresponding to 83.0% fully correct cases. A median distance to the rib centerline of 0.7 mm is achieved.

2 Method

We formulate the problem as predicting \(N = 16\) labeled key points for each posterior rib (2nd to 9th) close to its centerline (see Fig. 3a) in posterior-anterior chest radiographs. Our approach to solve this problem is split into three steps (compare Fig. 1): First, a FCN is used to regress heat maps to derive \(n = 15\) localization hypotheses \(\mathcal {\hat{X}}_i = \{{\mathbf {\hat{x}}}_{i,1}, \ldots , {\mathbf {\hat{x}}}_{i,n}\}\) for each key point (Fig. 1a). Second, the unary information of the localizer is combined with binary information assessing spatial features between key point localization hypotheses. Both are jointly modeled in a CRF with key points being the nodes and the corresponding localization hypotheses the respective states. An additional “refine” label is introduced for each node to be selected if no localization hypotheses is plausible (given the CRF model). CRF A* inference [9] is applied to find the best selection out of all possible selections \(|\mathcal {S}|= (n + 1)^N\). For each key point \(i\), the inference either selects a localization hypothesis , or the “refine” label if \(\hat{s}_i = 0\) (Fig. 1b). In the third step, we derive positions for the “refine”-labeled nodes. We fix all nodes with predicted positions and optimize each “refine” node in a small subgraph over all image pixels rather than over the \(n\) localization hypotheses only (Fig. 1c). The following three subsections describe each step in detail.

Fig. 1.
figure 1

Schematic illustration of our three-step approach: (a) Generation of localization hypotheses using a U-Net, (b) followed by a CRF modelling spatial relations between key points and (c) a final local refinement based on a subgraph considering the whole image domain.

2.1 Generating Localization Hypotheses Using a U-Net

The goal of the first step is to predict candidate positions for each key point. The basic idea is to transform an image \({\mathbf {I}} : \mathbb {R}^2 \rightarrow \mathbb {R}\) into pseudo (not normalized) probability maps \(\widetilde{\mathbf {P}}_i : \mathbb {R}^2 \rightarrow \mathbb {R}^{+}\) (for each target key point \(i\)) in which the location of the highest value \(\hat{\mathbf {x}}_{i,1} = \mathop {\mathrm {arg\,max}}\nolimits _\mathbf {x} \, \widetilde{\mathbf {P}}_i(\mathbf {x})\) corresponds to the most likely predicted position of key point \(i\).

To do so, we use U-Net [7], which has proven to deliver good results in the medical domain. However, its architecture is designed for pixel-wise segmentation, while we aim at localizing points to combine it with a CRF. Therefore, we directly formulate the problem as a pixel-wise heat map regression. This is achieved by dropping the soft-max classification and extending the final layer to \(N\) feature maps. Each feature map corresponds to a key point specific heat map that we want to regress. Assuming that high values in these heat maps correspond to likely positions of the searched key point, we can simply apply non-maximum suppression to each heat map to generate \(n\) localization hypotheses \(\mathcal {\hat{X}}_i = \{{\mathbf {\hat{x}}}_{i,1}, \ldots , {\mathbf {\hat{x}}}_{i,n}\}\) (we use \(n = 15\)) for each key point \(i\). This setup allows to generate localization hypotheses jointly for all key points in a single network.

Training. The modified U-Net is trained using stochastic gradient descent in the form of the Adam [10] algorithm using standard parameters with a mini-batch size of 8 and a sum-squared-error loss function. The target regression values are defined by a multivariate Gaussian distribution \(\mathcal {N}({\mathbf {x}}_i^*, \nicefrac {1}{9} {\mathbf {1}} r^2)\) with its mean located at the key point’s true position \({\mathbf {x}}_i^*\). This provides high values close to the true position and very low values outside a small neighborhood (\(r = 6\)). As advocated in [7], we also perform data augmentation in form of elastic transformations of the training images, effectively increasing the training set size by the factor 11. We stop the training after 1000 epochs.

2.2 Selecting Reasonable Localization Hypotheses Using a CRF

To compensate for potentially incorrect first best localization hypotheses \({\mathbf {\hat{x}}}_{i,1}\) for arbitrary key points \(i\), we use a CRF to model geometric relationships between key points. Each key point is represented by a node in the graph with the corresponding localization hypotheses \(\mathcal {\hat{X}}_i\) being the respective labels. We introduce an additional “refine” label for the CRF to choose during inference to compensate for cases where none of the localization hypotheses is plausible (and might negatively influence the selection of neighboring nodes). This “refine” label is used in our third step (Sect. 2.3) to still derive an accurate prediction in case CRF inference assigned the “refine” label to any node.

An energy-based formulation is applied where a low energy \(E({\mathbf {S}})\) of a selection \({\mathbf {S}} = (s_1, \ldots , s_N)\) implies a large posterior probability. For each node, either a localization hypothesis \({\mathbf {\hat{x}}}_{i,s_i}\) is assumed if \(s_i > 0\), or the “refine” label if \(s_i = 0\). The energy \(E({\mathbf {S}})\) of the CRF is parameterized by a set of \(J\) unary and binary potential functions \(\Phi = \{\phi _1(\cdot ), \ldots , \phi _J(\cdot )\}\) with corresponding weights \({\mathbf {\Lambda }} = (\lambda _1, \ldots , \lambda _J)\) and missing potential values \(\varvec{\upbeta } = (\beta _1, \ldots , \beta _J)\) for the “refine” label \(s_i = 0\):

$$\begin{aligned} \begin{aligned} E(\mathbf {S}) =&\sum _{j=1}^{J} \lambda _j \cdot \left\{ \begin{array}{ll} \beta _j &{} \, \text {if } s_i = 0 \text { for any } i \in \text {Scope}(\phi _j)\\ \phi _j(\mathbf {S}) &{} \, \text {else} \end{array}\right. \, . \end{aligned} \end{aligned}$$
(1)

The inclusion of the missing energy values \(\varvec{\upbeta }\) is necessary, because it is not possible to compute potential values for the “refine” label \(s_i = 0\). We use the same four potential types as introduced in [11]: For each key point \(i\), an unary potential \(\phi _i^\text {loc}({\mathbf {S}})\) corresponding to the localizer’s respective heat map value is introduced (see Fig. 2a). For each key point pair \(i\) and \(j\), a distance 0potential \(\phi _{i,j}^\text {dist}({\mathbf {S}})\), an angle potential \(\phi _{i,j}^\text {ang}({\mathbf {S}})\) and a vector potential \(\phi _{i,j}^\text {vec}({\mathbf {S}})\) are used to model the geometric relations. The probability densities of estimated Gaussian, von Mises and multivariate Gaussian distributions, respectively, are used as potential values (see Fig. 2b). Finally, to efficiently find the best selection \({\mathbf {\hat{S}}} = \mathop {\mathrm {arg\,min}}\nolimits _{{\mathbf {S}} \in \mathcal {S}} E({\mathbf {S}})\), exact inference in form of the A* algorithm [9] is applied.

Training. The weights \({\mathbf {\Lambda }}\) and the missing energies \(\varvec{\upbeta }\) are automatically learned in a training phase using a gradient descent scheme minimizing a max-margin hinge loss \(L\) over data \(\mathcal {D} = \{{\mathbf {d}}_1, \ldots , {\mathbf {d}}_K\}\). The idea is to increase the energy gap between the “correct” selection \({\mathbf {S}}^+\) and the best (lowest energy) “incorrect” selection \({\mathbf {S}}^-\) until a certain margin \(m\) is satisfied. Let our loss function be defined as

$$\begin{aligned} L({\mathbf {\Lambda }}, {\varvec{\upbeta }}) = \frac{1}{K} \sum _{k=1}^K \delta (\mathbf {S}_k^+, \mathbf {S}_k^-) \cdot \max \left( 0, m + E(\mathbf {S}_k^+) - E(\mathbf {S}_k^-) \right) + \theta \cdot ||{\mathbf {\Lambda }}||_1 \end{aligned}$$
(2)

subject to \(\lambda _j \ge 0\) for \(j = 1, \ldots , J\), with

$$\begin{aligned} \delta (\mathbf {S}_k^+, \mathbf {S}_k^-) = \frac{1}{N R} \left( e(\mathbf {S}_k^-) - e(\mathbf {S}_k^+) \right) \, \in \, [0, 1] \end{aligned}$$
(3)

weighting each training sample \(k\) w.r.t. the reduction in error (capped at \(R = 100\))

$$\begin{aligned} e(\mathbf {S}) = \sum _{i = 1}^N \left\{ \begin{array}{ll} 0 &{} \text {if }``\text { refine }'' (s_i = 0) \text { predicted and true}, \\ \min \left( R, ||\hat{{\mathbf {x}}}_{i,s_i} - {\mathbf {x}}_i^* ||_2\right) &{} \text {if }``\text {non-refine}'' (s_i > 0) \text { pred. and true, }\\ R &{} \text {else}, \end{array}\right. \end{aligned}$$
(4)

going from the incorrect selection \({\mathbf {S}}_k^-\) to the correct selection \({\mathbf {S}}_k^+\). The “refine” label (\(s_i = 0\)) is assumed true, if none of the localization hypotheses (\(s_i > 0\)) is correct (the localization criterion is defined in Sect. 3). An additional \(\theta \)-weighted L1 regularization term w.r.t. \({\mathbf {\Lambda }}\) was added to further accelerate the sparsification of terms. To optimize the loss function from Eq. (2), we apply again the Adam algorithm [10] starting from a grid-structured (Fig. 1b) graph. Once all CRF parameters are estimated, we remove unnecessary potentials where \(\lambda _j = 0\), effectively optimizing the graph topology and improving the inference time while simultaneously improving the localization performance.

2.3 Going Beyond Potentially Incorrect Localization Hypotheses

After finding the optimal selection \({\mathbf {\hat{S}}}\) using CRF inference, we look at all key points \(\{i \mid s_i = 0\}\) that have the “refine” label \(s_i = 0\) instead of a localization hypothesis assigned. In order to assign those nodes a position, we start by fixing all nodes \(\{i \mid s_i > 0\}\) with a properly selected localization hypothesis \({\mathbf {\hat{x}}}_{i,s_i}\). Then, we individually optimize each “refine”-labeled node by considering all connected binary potentials \(\Phi _i = \{\phi _j \mid i \in \text {Scope}(\phi _j) \, \wedge \, \exists \, i' : (i' \in \text {Scope}(\phi _j) \, \wedge \, s_{i'} > 0) \}\) that are fully specified (except for the current node). Given that this second inference

$$\begin{aligned} {\mathbf {\widetilde{x}}}_i = \mathop {\mathrm {arg\,min}}\limits _{\mathbf {x} \,\in \, \mathbf {I}} \sum _{\phi '_j \,\in \, \Phi _i} \lambda _j \cdot \phi '_j(\mathbf {x}) \end{aligned}$$
(5)

is performed on a very small subgraph, we can increase the search space to all possible pixel positions \({\mathbf {x}} \in {\mathbf {I}}\) for that node, rather than the set of localization hypotheses, which would be intractable on the full problem. Note that we used some handwavy notation \(\phi '_j\) to indicate that the (binary) potentials are computed by solely altering the position of key point \(i\), since all others are known and fixed. By optimizing all “refine” nodes in decreasing order of the number of connected potentials \(|\Phi _i |\), we can use previously refined positions in the next optimization in terms of more usable potentials. This also prevents the case that a node may not have any usable potential. How this approach can overcome the limitation of the fixed state space is illustrated on a test case in Fig. 2. Note that this final step does not require any training.

Fig. 2.
figure 2

Illustration of the refinement process for L9 (true key point indicated as red cross). The “refine” label was chosen by the CRF inference for L9 because all \(n\) localizer hypotheses – shown enumerated in (a), heat map overlaid on cropped original image – yield large total energies \(E({\mathbf {S}})\). Utilizing the connected (b) binary potentials from L8 and R9, we are still able to (c) predict a correct position (red point) for L9 by evaluating over all image pixels instead of just the (here incorrect) localization hypotheses.

3 Experiments and Results

We evaluated our approach on \({1000}\) consecutive images of various quality of the publicly available anonymized Indiana chest X-ray collection from the U.S. National Library of Medicine [8], downsampled to an isotropic resolution of 1\(\,\times \,\)1mm/px. To derive key points for the unlabeled images for training and evaluation, we started by generating unlabeled posterior rib centerlines using an automatic approach based on [4]. The generated centerlines were then manually checked for quality, i.e., the line should be properly located within the rib, and correctly labeled, potentially discarding images. Following this approach, we generated labeled centerlines for the posterior ribs \(\text {L2},\ldots ,\text {L9}, \text {R2},\ldots ,\text {R9}\) for 642 images. The middle points on the centerlines (w.r.t. the x-axis) have been selected as point annotations for each key point (except for the second and third rib where a factor of \(\pm 0.3\) and \(\pm 0.4\), respectively, instead of 0.5 was chosen). A corresponding predicted point is treated correct (localization and labeling criterion) if it is close to the annotated point (distance \(\le {15\,mm}\)) and very close to the centerline (distance \(\le {7.5\,mm}\)). This resembles the test whether the point lays on the rib while allowing for some translation along the rib. An example annotation as well as this localization criterion are depicted in Fig. 3a. Note that correct point localizations also mean correct labels for the previously generated unlabeled centerlines, which effectively means the automatic generation of labeled centerlines as well.

Fig. 3.
figure 3

(a) Illustration of the centerline and key point annotations and the resulting localization criterion, i.e., the area where a localization hypothesis is considered correct. Images in (a) and (b) are cropped and have been enhanced using adaptive histogram equalization. (b) Predicted positions in 642 test images visualized in a single image by registering the images using the true positions (affine and b-spline) and warping the predicted positions. Labels are shown color coded. (c) Typical errors involve an incorrect localization in the abdomen (first image) and chain errors caused by intermediate mistakes (second image).

We used a 3-fold cross-validation setup in our experiments, which provided us with 428 training images in each fold. Each training corpus was divided into three non-overlapping subsets \(\mathcal {D}_\text {pot}\), \(\mathcal {D}_\text {weights}\) and \(\mathcal {D}_\text {val}\), containing 50%, 40%,10% randomly selected training images, respectively. \(\mathcal {D}_\text {pot}\) was used to train the localizer (Sect. 2.1) and to estimate the statistics of the CRF potential functions (means, variances). \(\mathcal {D}_\text {weights}\) was used to optimize the CRF potential weights \({\mathbf {\Lambda }}\) and to estimate the missing energies \(\varvec{\upbeta }\). The last subset \(\mathcal {D}_\text {val}\) was used as validation corpus to select unknown meta parameters like learning rate and regularization parameter \(\theta \).

Applying our method, 94.6% of the key points were labeled correctly, corresponding to 83.0% of the images where all 16 key points were correct. The rates for individual key points and for the different steps in our chain are depicted in Fig. 4. First, we see that the CRF improves upon the plain U-Net results, especially in terms of the number of correct cases. Second, we see that the U-Net provides few good alternative localization hypotheses, which is apparent in a bad upper bound of the CRF of just 59.7% and justifies our third step. Third, we see that the additional CRF refinement step improves upon the CRF, where the percentage of correct cases increases dramatically from 57.3% to 83.0%. Fourth, the performance slightly decreases towards the lower ribs, which is probably caused by low contrast, higher variability and fewer meaningful surrounding structures (Fig. 3c). Errors in terms of Euclidean distance to the true position as well as distance to the centerline are listed in Table 1. The resulting median values of 2.8 mm and 0.7 mm, respectively, are in line with the visualization of the prediction results depicted in Fig. 3b. The overall average runtime of our approach per case comes down to 36 ms U-Net + 61 ms CRF + 73 ms refinement = 170 ms running our unoptimized Python implementation on an Intel Xeon CPU E5-2650 in combination with an NVIDIA Titan X.

See supplement 1 for supporting content.

Fig. 4.
figure 4

Rates for correct cases (i.e., all 16 key points localized correctly) and correctly localized key points for our three steps in percent. Upper bound indicates the theoretical maximal performance of the CRF, caused by the limitation of the state space to the set of localization hypotheses. 100% corresponds to 642 cases and individual key points (L2-R9), and \(642 \cdot 16 = {10272}\) total key points.

Table 1. Median and mean Euclidean distance between true and predicted position (error) as well as median and mean distance between centerline and predicted position (line distance) for individual and all key points in mm.

4 Discussion and Conclusions

We presented a general approach for localization and labeling of spatially correlated key points and applied it to the task of rib localization. The state-of-the-art FCN U-Net has been used as localizer, which was regularized by a CRF incorporating spatial information between key points. The limitation of a reduced CRF state space in form of localization hypotheses imposed by the exact CRF inference in large graphs has been overcome with a novel “refine” node label. After a first CRF inference, a second inference is performed on small subgraphs formed by the marked “refine” nodes to refine the respective key points over all image pixels (rather than the set of localization hypotheses). Applying our approach to 624 images of the publicly available Indiana chest X-ray collection [8], we were able to correctly localize and label 94.6% of the 16 key points, corresponding to 83.0% fully correct cases. The introduced refinement allowed for an increase of 25.7 percent points in fully correct cases over the global CRF alone. Note that this was achieved without domain-specific assumptions; all CRF model parameters were automatically learned from annotated training data. Our approach is thus directly applicable to other anatomical localization tasks.

In future work, we are going to increase the rotation and scaling invariance by incorporating ternary potentials over the commonly used binary ones, with tractability being the main challenge.