1 Introduction

High-energy pelvic fractures, which are usually related to motor vehicle accidents, falls from height, or crush injury, are the second leading cause of death from acute physical trauma after brain injury. The mortality rate of pelvic fractures ranges from \(5\%{-}15\%\), overall, increasing from \(36\%\) to \(54\%\) in those with hemorrhagic shock [12]. With the widespread availability of CT in trauma bays, the majority of patients with severe pelvic trauma admitted to level I trauma centers currently undergo an examination with contrast-enhanced trauma CT, in part to assess for foci of active bleeding, manifesting as contrast extravasation [3]. The size of foci of contrast extravasation from bleeding vessels correlates with the need for blood transfusion, angiographic or surgical hemostatic intervention, and mortality, but reliable measurements of contrast extravasation volume cannot be derived at the point of care using manual, semi-automated, or shorthand diameter-based methods. Fully automated methods are necessary for real-time point-of-care decision making, treatment planning, and prognostication (Fig. 1).

In this paper, we focus on volumetric segmentation of foci of active bleeding (i.e. contrast extravasation) after pelvic fractures. This task is of vital importance yet challenging for the following reasons: (1) hemorrhage gray levels vary from patient to patient, depending on a variety of factors (e.g., the rate of bleeding, the timing of the scan, and the patient’s physiologic state after trauma), (2) hemorrhage boundaries are often very poorly defined and highly irregular; and (3) the intensity levels are inconsistent throughout the region of a hemorrhagic focus. Prior works have utilized semi-automated threshold- or region growing-based methods using post-processing software [5]. However, these techniques are too time-consuming for clinical use in the trauma radiology setting. To overcome this difficulty, a method [4] was previously proposed to first utilize spatial contextual information from artery and bone to detect the hemorrhage, and then employ a rule-based strategy to refine the segmentation results. This heuristic approach requires multiple stages which cannot be efficiently optimized end-to-end. Moreover, this method cannot properly handle other challenges such as variation of target sizes and ambiguous boundaries.

Fig. 1.
figure 1

Visual examples of pelvic CT scans from axial/coronal/saggital views. Red contour denotes the boundaries of the active hemorrhage, where we can observe large variations of shape and textures. (Color figure online)

Recently, the emerge of deep learning has largely advanced the field of computer aided diagnosis (CAD). Riding on the success of convolutional neural networks, e.g., fully convolutional networks [9], researchers have achieved accurate segmentation on many medical image analysis tasks [10, 11, 15, 16]. Existing coarse-to-fine methods [14, 15], which propose to refine segmentation results through explicit cropping of a single region of interest (ROI) are more suitable for single connected structures such as the pancreas or liver, while sites of active bleeding are frequently discontinuous and multi-focal and occur in widely disparate vascular territories. Herein, we present a multi-scale attentional network (MSAN), for segmenting active bleed after pelvic features, the first yet reliable framework, for segmenting active bleed after pelvic features. Specifically, our framework is able to (1) fully exploit contextual information from holistic 2D slices via using an encoder which is capable of extracting the global contextual information across different levels of image features; (2) efficiently handle the variation of active hemorrhage sizes by adopting multi-scale strategies during the training phase and the testing phase; (3) deal with the ambiguous boundaries by utilizing an attentional mechanism to better enhance the discrimination between trauma region and non-trauma region; (4) utilize the aggregation of multiple views (i.e., Coronal, Sagittal and Axial views) to further leverage the 3D information. To assess the effectiveness of our framework, we collect a dataset of 65 patients with pelvic fractures and active hemorrhage with widely varying degrees of severity. For each case, every pixel/voxel of active hemorrhage was manually labeled by an experienced radiologist. Unlike the previously described heuristic method which used crude and not widely adopted measurements of accuracy such as missegmented area [4], we employed the Dice-Sørensen coefficient (DSC) for evaluation based on pixel/voxel-wise predictions. Experimental results demonstrate the superiority of our framework compared with a series of 2D/3D state-of-the-art deep learning algorithms.

2 Multi-scale Attentional Network

2.1 Overall Framework

We denote a 3D CT-scanned image as \(\mathbf {X}\) with size \(W\times H\times L\), where each element of \(\mathbf {X}\) indicated the Housefield Unit (HU) of a voxel. The corresponding binary ground-truth segmentation mask is denoted as \(\mathbf {Y}\) where \({y_i}={1}\) indicates a foreground voxel. Consider a segmentation model \(M:{\mathbf {Z}}={\mathbf {f}\,\!\left( \mathbf {X};\varTheta \right) }\), where \(M\) is parameterized by \(\varTheta \), our goal is to predict a binary output volume \(\mathbf {Z}\) of the same dimension as \(\mathbf {X}\). We denote \(\mathcal {Y}\) and \(\mathcal {Z}\) as the set of foreground voxels in the ground-truth and prediction, i.e., \({\mathcal {Y}}={\left\{ i\mid y_i=1\right\} }\) and \({\mathcal {Z}}={\left\{ i\mid z_i=1\right\} }\). The accuracy of segmentation is evaluated by the Dice-Sørensen coefficient (DSC): \({\mathrm {DSC}\,\!\left( \mathcal {Y},\mathcal {Z}\right) }= {\frac{2\times \left| \mathcal {Y}\cap \mathcal {Z}\right| }{\left| \mathcal {Y}\right| +\left| \mathcal {Z}\right| }}\). This metric falls in the range of \(\left[ 0,1\right] \), and DSC = 1 implies a perfect segmentation.

Following [11, 14, 15], 3 sets of images, i.e., \(\mathbf {X}_{\mathrm {C},w}\) (\({w}={1,2,\ldots ,W}\)), \(\mathbf {X}_{\mathrm {S},h}\) (\({h}={1,2,\ldots ,H}\)) and \(\mathbf {X}_{\mathrm {A},l}\) (\({l}={1,2,\ldots ,L}\)) are obtained along three axes. The subscripts \(\mathrm {C}\), \(\mathrm {S}\) and \(\mathrm {A}\) stand for “coronal”, “sagittal” and “axial”, respectively. We train an individual model \(M\) for each of the three viewpoints. Without loss of generality, we consider a 2D slice along the axial view, denoted by \(\mathbf {X}_{\mathrm {A},l}\). Our goal is to infer a binary segmentation mask \(\mathbf {Z}_{\mathrm {A},l}\) of the same dimensionality. In the context of deep networks [1, 9], it is achieved by computing a probability map \({\mathbf {P}_{\mathrm {A},l}}={\mathbf {f}\,\!\left[ \mathbf {X}_{\mathrm {A},l};\theta \right] }\), where \(\mathbf {f}\,\!\left[ \cdot ;\theta \right] \) is the architecture as in Fig. 2(a). This network contains an encoder (Sect. 2.2) to extract different levels of features for distilling global context and an attentional module (Sect. 2.3) as further refinement.

Specifically, we apply Atrous Spatial Pyramid Pooling (ASPP) [1] at the end of the backbone model to extract high-level features with enriched global context. Meanwhile, the low-level features extracted from earlier layers which contain local information are fed to an attentional module to distill more useful information. The refined low-level features are then concatenated with high-level features extracted by ASPP and fed to the final classifier layer, which outputs probabilities \(\mathbf {P}_{\mathrm {A},l}\), \(\mathbf {P}_{\mathrm {C},l}\) and \(\mathbf {P}_{\mathrm {S},l}\) which are then binarized into \(\mathbf {Z}_{\mathrm {A},l}\), \(\mathbf {Z}_{\mathrm {C},l}\) and \(\mathbf {Z}_{\mathrm {S},l}\) respectively. The final segmentation outcome can be fused from the three views via majority voting [14, 15]. Multi-scale processing [1, 8] is used in both the training stage and the inference stage to further enhance the segmentation accuracy, especially for small targets. As illustrated in Fig. 2, different rescaled version of the original image are fed to the network during training. During the testing stage, to produce the final segmentation mask, the output from different scales are fused by taking at each position the average response. If the average probability is larger than a certain threshold \(\rho \) it is regarded as foreground otherwise it is regarded as background.

2.2 Encoder Backbone Architecture

Atrous Convolution has been widely applied in computer vision problems, which can efficiently allow for larger receptive field via controlling atrous rates. Given an input feature map \(x\), atrous convolution is applied over \(x\) as follows:

$$\begin{aligned} y[i] = \sum _{k} x[i + r \cdot k] w[k], \end{aligned}$$
(1)

where \(i\) and \(w\) denote the spatial location and the convolution filter, respectively. r stands for the atrous rate.

Atrous Spatial Pyramid Pooling (ASPP) is originated from Spatial Pyramid Pooling [7]. The main difference is that ASPP uses atrous convolution which allows for larger field-of-view during training and thus can efficiently integrate global contextual information. As a strong contextual aggregation module [1], ASPP is applied (see Fig. 2(a)) so that the contextual information from artery and bone can be better exploited. In our experiment, we set the atrous rates to be \(\{12, 24, 36\}\), respectively.

Fig. 2.
figure 2

(a) The network architecture structure of MSAN. Low-level features are refined by an attentional module. Meanwhile ASPP is applied at the end of the backbone model to extract high-level features with enriched global context. (b) Our implementation of the attentional module, where we use nonlocal means [13] as the main operation.

2.3 Attentional Module

We adapt the non-local block [13] as the attentional module in our framework. Specifically, it first computes an attention map y of an input feature map x by taking a weighted average of features in all spatial locations \(\mathcal {L}\):

$$\begin{aligned} y_i = \frac{1}{\mathcal {C}(x)} \sum _{\forall j \in \mathcal {L}} f(x_i, x_j)\cdot x_j, \end{aligned}$$
(2)

where i and j are spatial indices. A pairwise function \(f(x_i, x_j)\) is used to compute the spatial attention coefficients between each i and all j. And these coefficients are applied as the weighting of the input feature to better prune out irrelevant background features and thereby distinguish salient image regions. \(\mathcal {C}(x)\) is a normalization function. We use the dot product version in [13] by setting \(f(x_i, x_j) = x_i^\text {T} x_j\) and \(\mathcal {C}(x) \,\!=\,\! N\), where N is the number of pixels in \(x\).

Following [13], the attention map \(y\) is then processed by a 1\(\times \)1 convolutional layer and added to the input feature map \(x\) to obtain the final output \(z\), i.e., \(z = w y + x\), where \(w\) is the weight of the convolutional layer. An illustration our attentional module can be found in Fig. 2(b).

3 Experiments

3.1 Dataset and Evaluation

We have collected 65 studies were routinely acquired with 64 section or higher MDCT scanners in the trauma bay in either the late arterial or portal venous phase of enhancement. We use 45 cases for training and evaluate the segmentation performance on the rest 20 cases. Note that [4] was studied on only 12 cases, which, to the best of our knowledge, was the first and only curated dataset with manual ground truth label masks. Therefore our dataset can be considered as a valid set for evaluation. The metric we use is DSC, which measures the similarity between the prediction voxel set \(\mathcal {Z}\) and the ground-truth set \(\mathcal {Y}\), with the mathematical form of \({\mathrm {DSC}\,\!\left( \mathcal {Z},\mathcal {Y}\right) }= {\frac{2\times \left| \mathcal {Z}\cap \mathcal {Y}\right| }{\left| \mathcal {Z}\right| +\left| \mathcal {Y}\right| }}\).

3.2 Implementation Details

Our implementations are based on Tensorflow. We used two standard architectures, i.e., ResNet-50 and ResNet-101 [6] as backbone models. All our segmentation experiments were performed on the whole pelvic CT scan and were run on Tesla V100 GPU. For data pre-processing, following [11], we simply truncated the raw intensity values to be within the range of \([-80, 320]\) HU and then normalized each raw CT case to [0, 255.0]. Random rotation of [0, 15] is used as online data augmentation. A poly learning policy is applied with an initial learning rate of 0.05 with a decay power of 0.9. We follow [11, 14, 15] to use ImageNet pretrained model for initialization.

Table 1. DSC comparison of active bleed segmentation. ResNet101-MSAN-3-scale achieves the best performance of 59.89%, surpassing the prior art by more than \(7\%\).

3.3 Results and Discussions

All results are summarized in Table 1, where we list thorough comparisons under different configuration of network architecture (i.e., ResNet50 and ResNet101 [6]) and scales (i.e., \(scales=\{1.0, 1.25, 1.5, 1.75\}\)). Note that we use larger scales (\({\ge }1.0\)) since our goal is to segment small targets. Under different settings, our method consistently outperforms others, indicating the effectiveness of MSAN.

Efficacy of Multi-scale Processing. As shown in Table 1, larger scales generally lead to better results. For instance, using ResNet50 as the backbone model, the performance under \(scale=1.0\) is \({\sim }10\%\) lower than that under other larger scales. ResNet101-single-scale yields the best result of \(54.56\%\) under \(scale=1.75\), which is more than \(17\%\) better than using the scale of 1.0. These facts all indicate the efficacy of utilizing larger scales. Another observation is that the integration of more scales also leads to better segmentation quality than using just one scale. Using either ResNet50 or ResNet101 as the backbone, 3-scales always yield better results than 2-scales/single-scale, which shows that the learned knowledge from these different scales is complementary to each other. Therefore combining the information from these different scales can be beneficial for handling targets with a large variety of sizes, such as active bleed in our study.

Fig. 3.
figure 3

Qualitative comparison of different methods. from left to right: original CT image, predictions of single-scale method (\(scale=1.50\)), multi-scale method (\(scale=\{1.25, 1.50, 1.75\}\)), MSAN and the manual label. (Best viewed in color)

Efficacy of the Attentional Module. Meanwhile, we also witness additional benefit from the attentional module. For instance, ResNet50-MSAN-3-scale observes an improvement of \(1.17\%\) compared with ResNet101-3-scale; ResNet101-MSAN-2-scale) observes an improvement of \(0.72\%\) compared with ResNet101-2-scale. A similar improvement can be also witnessed for ResNet-50. Three qualitative examples are shown in Fig. 3, where MSAN consistently outperforms other existing methods. For case 027, our MSAN successfully removes the outlier (indicated by the orange arrows) which is detected as false positives by other methods. This further justifies that the usage of attentional mechanisms can indeed refine the results and diminish non-trauma outliers.

Overall, our proposed MSAN observes a significant performance gain under different settings, which shows the generality and soundness of our approach. Additionally, we also compare our method with other state-of-art 3D segmentation methods including [14, 15] and [2]. Our method outperforms all these methods significantly (p-values for testing significant difference satisfy \(p < 0.0001\)), which further demonstrates the effectiveness of our approach. In order to further validate the generality and stability of MSAN, we directly test on a newly collected additional 15 cases without any retraining. Our method obtains an average DSC of \(50.19\%\), whereas prior arts report \(44.15\%\) [14], \(35.14\%\) [15] and \(27.32\%\) [2]. MSAN significantly outperforms these methods.

4 Conclusions

In this paper, we present Multi-Scale Attentional Network (MSAN), an end-to-end framework for automated segmentation of active hemorrhage from pelvic CT scans. Our proposed MSAN substantially improves the segmentation accuracy by more than \(7\%\) compared with prior arts. We note this framework can be practical in assisting radiologists for clinical applications, since the annotation in 3D volumes requires massive labor from radiologists.