Keywords

1 Introduction

Erector spinae muscles (ESMs) are important muscles acting on extension and rotation at the trunk. In the chronic low back pain and the degenerative lumbar scoliosis (DLS), changes of the size, shape and density of the cross-sectional area (CSA) of the ESMs are found [1, 2]. Furthermore, in the chronic obstructive pulmonary disease, the cross sectional area of the ESMs of the 12th thoracic vertebra is an excellent prognostic factor [3]. However, image analysis of these spinal column erector muscles is performed manually by clinicians. Therefore, the measurements suffer from inter-clinician reliability and intra-clinician reproducibility. In addition, spinal erector muscle is relatively large and has many adjacent muscles, extraction requires expertise and time consuming manual work. For these reasons, the current analysis remains limited to two-dimensional (2D) CSA, and investigation of the relationship between muscle and disease using three-dimensional (3D) area of ESM has not been realized.

Automatic recognition of skeletal muscle using computed tomography (CT) images is divided into 2D and 3D based methods. Wei et al. [4] realized the atlas based method to recognize the ESM automatically. In addition, there is automatic recognition method of skeletal muscle using finite element method (FEM) [5]. We proposed a deep convolutional neural network (CNN) based method to automatically recognize the ESM in the 12th thoracic section and obtained an average Jaccard coefficient (JC) of \(82.4\%\) [6]. On the other hand, in the method based on 3D, the goal is to obtain a 3D region of skeletal muscle. We created a computational anatomical model imitating muscle running and realized automatic recognition of surface parts [7] and deep muscles [8]. Moreover, in the automatic recognition method of ESM using random forest, the average Dice coefficient (DC) was \(93.0\,{\pm }\,2.1\%\) [9]. In addition, Yokota et al. [10] realized automatic recognition of skeletal muscle in hip and femoral region by hierarchical multi-atlas method.

Analysis of diseases associated with the ESM [1,2,3] requires extracting a section corresponding to the level of the medullary node of the vertebrae. Furthermore, recognition of the anatomical attachment position of the skeletal muscle is important for generation of a computed anatomical model on the CT image and appropriate utilization of the model. Actually, in the creation of a muscle running model, the origin and insertion of each muscle is used [7, 8]. In addition, in recognition of skeletal muscle at the shoulder part, recognition accuracy was improved by utilizing the structural features of the scapula which is the attachment part of the muscle in model building [11]. Therefore, recognition of the anatomical attachment position information on the muscle on the skeleton is necessary for construction and utilization of the model, analysis of the relationship between muscle and disease, as well as muscle recognition.

In this study using 2D-deep CNN, we aim to acquire not only muscle recognition results, but also regional information of origin and insertion which becomes attachment area on the skeleton which is necessary for muscle analysis.

Fig. 1.
figure 1

A network of automatic recognition of erector spinae muscle and its attached areas on the skeleton. In three-dimensional (3D) to two-dimensional (2D) image sampling, each section is extracted from the computed tomography image as input images. In 2D to 3D label voting, recognition results of each section are integrated into a 3D image. Details of the fully convolutional network are described in Fig. 2.

2 Method

2.1 Overview

The proposed method is based on the automatic recognition method of multiple organs in 3D CT images using deep CNN [12]. The outline of this method is shown in Fig. 1. The input image is a torso CT image, and the output image is a label image of the spinal column erector muscle and the attached region on the skeleton. First, 2D images of three anatomical sections are obtained from input CT images. Thereafter, each 2D cross-sectional image is input to deep CNN, and region recognition is performed on each 2D cross-sectional image. Finally, recognition results in each obtained cross section are integrated as 3D images using label probabilities. A fully convolutional network (FCN) [13] is used for region recognition in the 2D section. In the training process in FCN, a CT image and a ground truth image obtained by extracting ESM and attachment areas on the skeleton of the ESM are used.

2.2 3D to 2D Image Sampling and 2D to 3D Label Voting

In our proposed method, 2D cross-sectional images are generated from 3D CT images as input images. Then, the ESM which is a target region in the 2D cross section and its attached region are recognized, and finally the recognition result in each cross section is reconstructed into a 3D image. It should be noted that each voxel on the 3D CT image belongs to a plurality of 2D cross-sectional images. In other words, by recognizing a target region with respect to a 2D image of a plurality of cross sections, it is aimed at enhancing recognition accuracy by performing label prediction a plurality of times for each voxel. Here, 2D images of three orthogonal cross sections, axial, coronal and sagittal, are created. As a result, each voxel is always arranged in three 2D images. After region recognition using 2D images, each voxel obtains three recognition results for each section. The result of recognition of each section is integrated into a 3D image using majority voting. The final label is determined by the maximum value of the product of the probabilities of each cross section.

Fig. 2.
figure 2

Fully convolutional network structure of our proposed method (K: kernel size, S: stride).

2.3 ESM and Its Attachment Region Segmentation Using FCN

In this method, FCN is used in order to perform region recognition in 2D images of each section generated from a 3D image. The structure of FCN is composed of two layers, which are down sampling layer and up sampling layer, respectively. First, abstract information is extracted in the down sampling layer, and in the latter half of the up sampling layer, labels are predicted in pixel units. Each parameter of FCN is optimized by learning.

Figure 2 shows the FCN structure used in the proposed method. The down sampling layer consists of sixteen \(3\,{\times }\,3\) convolution layers, five pooling layers and three full connected layers based on the network structure of VGG 16 [14]. In the FCN, the full connected layer in VGG 16 is replaced by a convolution layer. The last \(1\,{\times }\,1\) convolution layer sets the number of labels classified channels. In this method, it is the three regions of the background, the ESM and its attachment region on the skeleton. The up sampling layer is composed of three deconvolution layers and two convolution layer. This network has a skip structure that uses the information lost in the convolution layer of the VGG 16 in the deconvolution layer. The network with one deconvolution layer is called FCN-32s and learning of FCN is repeated with the addition of deconvolution layer to construct FCN-16s, FCN-8s. In this method, the output of FCN-8s is taken as the recognition result of the 2D image. The activation function uses a rectified linear unit (ReLU).

2.4 Input Label Image

In the learning process of the network, the original image and the ground truth image are used. For the ground truth image, manually segmented images are used. An example of the ground truth image is shown in Fig. 3. Figure 3(a) shows the whole ESM in a 3D representation. A pair of the muscles are present on both sides of the body. The middle diagram shows the attachment area on the skeleton. Here, in the dorsal side of the ribs and the transverse process of the thoracic vertebra, the area on the skeleton which is in attached with the muscle is defined as the ground truth. This corresponds to the origin and insertion of the iliopsoas muscle and the longissimus muscle among the muscles constituting the ESM. In the learning process, the ESM and the attachment region on the skeleton are learned at the same time. Figure 3(b) shows a cross section where the ground truth on the original CT.

3 Experiment

CT images used in this study are non-contrast torso CT images taken by Light Speed Ultra 16 (manufactured by General Electric) at Gifu University Hospital, Japan. All the data have an isotropic voxel resolution of 0.625 mm. The size of the data ranges from \(512\,{\times }\,512\,{\times }\,802\) voxels to \(512\,{\times }\,512\,{\times }\,1031\) voxels. Eleven cases were used for the experiment and evaluated by the leave-one-out method. In learning, we used VGG 16’s model trained with ImageNet ILSVRC-2014 data set [14] as a preliminary learning model. The DC, JC, recall rate and precision rate are used to evaluate recognition results of spinal column erector muscle and attached region on the skeleton.

For the implementation environment, the GPU uses 12 GB of NVIDIA GeForce TITAN - X, and the framework uses Caffe.

Fig. 3.
figure 3

Ground truth image. (a) Erector spinae muscle (green) and skeleton (gray). (b) Muscle attachment region on the skeleton (red). (c) Erector spinae muscle (green) and its attachment region on the skeleton (red). (Color figure online)

4 Results

Recognition results of ESMs in 11 cases are shown in Table 1. The mean JC of ESM recognition result was \(81.7\,{\pm }\,3.2\%\), and the average DC was \(89.9\,{\pm }\,2.0\%\). The average JC of recognition results of the ESM on the twelfth thoracic vertebra section was \(85.6\,{\pm }\,3.7\%\), and the average DC was \(92.2\,{\pm }\,2.2\%\). In addition, Table 2 shows the recognition result of the attachment region on the skeleton. The average JC of the recognition result of the attachment area on the skeleton was \(48.8\,{\pm }\,3.7\%\), and the average DC was \(65.5\,{\pm }\,3.3\%\). Figure 4 shows an example of the recognition result in 2D, and Fig. 5 shows the recognition result in 3D.

Fig. 4.
figure 4

Example of the recognition result in two-dimensional cross sections. (a) Original computed tomography images. (b) Ground truth images. (c) Recognition results.

Fig. 5.
figure 5

(a) Ground truth. (b) Recognition result.

Table 1. Recognition result of the erector spinae muscles (JC: Jaccard coefficient, DC: Dice coefficient, RC: recall rate, PR: precision rate).

5 Discussion

The automatic recognition result of the ESM using 2D-deep CNN achieved an average DC of \(89.9\,{\pm }\,2.0\%\). The achieved accuracy is slightly worse than that achieved by our random forest based ESM recognition method [9]. Although both methods used the same training dataset, we attribute the less accurate results to the fact that deep CNN requires more learning cases as compared with conventional machine learning methods. On the other hand, the mean JC in the 12th thoracic vertebral section of this method was \(85.6\,{\pm }\,3.7\%\). This is a high recognition accuracy compared with the average Jaccard coefficient of \(82.4\%\) in the automatic recognition method of the ESM in the 12th thoracic section using deep CNN in our previous study [6]. In this study, we consider not only the learning of the axial cross section but also the sagittal and the coronal sections, so in large skeletal muscle such as the ESM, learning process using both coronal and sagittal section is effective. Although the numerical value of the muscle attachment accuracy is low, as shown in Figs. 4 and 5, the origin and insertion region is well recognized. The anatomical attachment site of skeletal muscle is one of the essential elements for orthopedic intervention and is important as well as recognition of skeletal muscle region.

Table 2. Recognition results of the erector spinae muscle attachment region on the skeleton (JC: Jaccard coefficient, DC: Dice coefficient, RC: recall rate, PR: precision rate).

In the next step, it is necessary to conduct a large-scale experiment with an increased number of cases and to verify the ESM recognition accuracy in deep CNN. However, it is not easy to create many ground truth of large and complex skeletal muscles such as the ESM. Therefore, it is necessary to efficiently generate a learning image in deep CNN by using our method using high speed and high performance random forest [9].

6 Conclusion

In this study, automatic recognition of ESMs and its attachment region on the skeleton in torso CT image by using deep CNN was performed. As a result of the leave-one-out cross validation test using eleven cases, the average Dice coefficient of ESM was \(89.9\,{\pm }\,2.0\%\). In the 12th thoracic vertebra, the mean Jaccard coefficient was \(85.6\,{\pm }\,3.7\%\). This result shows that automatic recognition is realized with high coincidence ratio in clinically important two-dimensional cross section, and it is a result that enables quantitative analysis by 3D. Although numerical recognition accuracy was low, simultaneous automatic recognition of the skeletal muscle and its anatomical attachment site, origin and insertion, was realized. For future work, we aim to clarify the relationship of 3D ESM using the recognized muscle region and its attachment position on the skeleton.