Supervised Action Classifier: Approaching Landmark Detection as Image Partitioning

Xu, Zhoubing; Huang, Qiangui; Park, JinHyeong; Chen, Mingqing; Xu, Daguang; Yang, Dong; Liu, David; Zhou, S. Kevin

doi:10.1007/978-3-319-66179-7_39

Zhoubing Xu²¹,
Qiangui Huang²²,
JinHyeong Park²¹,
Mingqing Chen²¹,
Daguang Xu²¹,
Dong Yang²³,
David Liu²¹ &
…
S. Kevin Zhou²¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10435))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

13k Accesses
10 Citations

Abstract

In medical imaging, landmarks have significant clinical and scientific importance. Clinical measurements, derived from the landmarks, are used for diagnosis, therapy planning and interventional guidance in many cases. Automatic algorithms have been studied to reduce the need for manual placement of landmarks. Traditional machine learning techniques provide reasonable results; however, they have limitation of either robustness or precision given complexities and variabilities of the medical images. Recently, deep learning technologies have been emerging to tackle the problems. Among them, a deep reinforcement learning approach (DRL) has shown to successfully detect landmark locations by implicitly learning the optimized path from a starting location; however, its learning process can only include subsets of the almost infinite paths across the image context, and may lead to major failures if not trained with adequate dataset variations. Here, we propose a new landmark detection approach inspired from DRL. Instead of learning limited action paths in an image in a greedy manner, we construct a global action map across the whole image, which divides the image into four action regions (left, right, up and bottom) depending on the relative location towards the target landmark. The action map guides how to move to reach the target landmark from any location of the input image. This effectively translates the landmark detection problem into an image partition problem which enables us to leverage a deep image-to-image network to train a supervised action classifier for detection of the landmarks. We discuss the experiment results of two ultrasound datasets (cardiac and obstetric) by applying the proposed algorithm. It shows consistent improvement over traditional machine learning based and deep learning based methods.

You have full access to this open access chapter, Download conference paper PDF

Multiple Landmark Detection Using Multi-agent Reinforcement Learning

Uncertainty Aware Deep Reinforcement Learning for Anatomical Landmark Detection in Medical Images

An Artificial Agent for Anatomical Landmark Detection in Medical Images

Keywords

1 Introduction

Landmarks are commonly used to represent anatomical features in medical imaging. Clinicians use landmarks to derive measurements (e.g., width, length, size, etc.) of organs for diagnosis, while radiologists and scientists register two images using corresponding sets of landmarks for further analyses. Ultrasound imaging is a widely used clinical procedure because it is safe, cost-effective, and non-invasive, where landmarks in a certain plane are used to provide diagnostic references. In cardiac ultrasound scans, landmarks are typically defined to measure the width at the intersections between heart chambers, for example, the annulus points of mitral valves; in obstetric ultrasound scans, landmarks at the anterior and posterior end of the fetal head are considered important. Manual localization of the landmark points, however, is tedious and time consuming. In an ultrasound machine, user needs to use the track ball to adjust the caliper to the desirable location, which makes the work even more complex. Furthermore, the reliability of the measurements can be suffered from the subjective disagreement across users. Automating the landmark detection can substantially reduce the manual efforts, and make the clinical procedure more efficient; however, this is a very challenging task given the (1) noisy signal, (2) low contrast, and (3) variations in shapes, orientations, and respiration phases throughout ultrasound images (Fig. 1).

The landmark detection problem has been studied using machine learning algorithms with reasonable outcomes. A bootstrapped binary classifier, e.g., probabilistic boosting-tree (PBT [1]), can be trained to distinguish landmark and non-landmark locations [2]; this approach can be biased due to the highly unbalanced positive and negative samples. Alternatively, landmark locations can be learned in a regression manner through aggregating pixel-wise relative distances to the landmark [3]; it provides more robustness, but less precision than the classification-based approach due to the complexity and variation of the image context. Recently, deep learning technologies have been adapted to medical imaging problems, and demonstrated promising performances by leveraging features trained from convolutional neural networks as opposed to hand-crafted features used in traditional machine learning approaches [4, 5]. For landmark detection, a deep reinforcement learning (DRL) approach has been shown successful to detect annulus points in cardiac ultrasound images [6]. The DRL algorithm designs an artificial agent to search and learn the optimized path from any location towards target by maximizing an action-value function. Its greedy searching strategy allows the agent to walk through only a subset of the almost infinite paths across the image instead of scanning exhaustively; however, this may lead to major failures if not trained with adequate dataset variations.

Here, we propose a new landmark detection approach inspired from DRL with the motivation of covering the entire searching space. We find that the optimal path can be broken down into optimal action steps at every pixel, while the pixel-wise optimal action steps can be derived given the landmark location based on Euclidian distances to generate an action map. Therefore, we can train a supervised action classifier (SAC) by explicitly learning the action steps across image instead of learning the actions implicitly along the searching path in DRL. The generation of action map effectively translates landmark detection into an image partitioning problem, where the highly unbalanced positive/negative sampling in PBT can be prevented. This also enables us to leverage a fully convolutional image-to-image neural network to train the SAC for estimating the action map. Furthermore, we design a robust aggregative approach to derive the landmark location from the estimated action map (Fig. 2), where our action-based aggregation is more precise than distance-based aggregation. To the best of our knowledge, we are the first to address landmark detection in the way of image partitioning. In this paper, we apply the proposed approach to a cardiac and an obstetric ultrasound dataset for landmark detection and compare the results with other learning-based methods.

2 Theory

2.1 Landmark Representation Based on Action Map

For the purpose of landmark detection, a landmark can be represented by an action map in terms of the pixel-wise optimal action step toward the landmark. Consider an optimal action path from any location at $ \left( {x, y} \right) $ towards landmark $ t $ at $ \left( {x_{t} , y_{t} } \right) $ is composed of optimal action steps at pixels along the path on an image $ I $. At each pixel, we define a unit movement $ d_{x}^{\left( a \right)} = \left\{ { - 1, 0,1} \right\} $ and $ d_{y}^{\left( a \right)} = \left\{ { - 1, 0,1} \right\} $. With the constraint of $ d_{x}^{\left( a \right)} + d_{y}^{\left( a \right)} = 1 $, we allow four possible action types $ a\, \in \,\left\{ {0, 1, 2, 3} \right\} $, i.e., up $ \left( {d_{x}^{\left( 0 \right)} = 0, d_{y}^{\left( 0 \right)} = - 1} \right) $, right $ \left( {d_{x}^{\left( 1 \right)} = 1, d_{y}^{\left( 1 \right)} = 0} \right) $, down $ \left( {d_{x}^{\left( 2 \right)} = 0, d_{y}^{\left( 2 \right)} = 1} \right) $, and left $ \left( {d_{x}^{\left( 3 \right)} = - 1, d_{y}^{\left( 3 \right)} = 0} \right) $, respectively. The optimal action step $ \hat{a} $ is selected as the one with minimal Euclidian distance to landmark $ t $ after its associated movement,

$$ \hat{a} = \mathop {\text{argmin}}\nolimits_{a} \sqrt {\left( {x - x_{t} + d_{x}^{\left( a \right)} } \right)^{2} + \left( {y - y_{t} + d_{y}^{\left( a \right)} } \right)^{2} } $$

(1)

After cancelling out the common term, i.e., $ \left( {x - x_{t} } \right)^{2} + \left( {y - y_{t} } \right)^{2} + 1 $, $ \hat{a} $ is shown to be dependent on the pixel location $ \left( {x, y} \right) $, where

$$ \hat{a} = \mathop {\text{argmin}}\nolimits_{a} \left( {x - x_{t} } \right)d_{x}^{\left( a \right)} + \left( {y - y_{t} } \right)d_{y}^{\left( a \right)} $$

(2)

By replacing $ d_{x}^{\left( a \right)} $ and $ d_{y}^{\left( a \right)} $ with their actual values, the selection of $ \hat{a} $ falls into four regions (one for each action type), where the regions are partitioned by two lines with slopes of ±1 crossing the landmark (see the top panel in Fig. 2), i.e., $ y = x + \left( {y_{t} - x_{t} } \right) $ and $ y = - x + \left( {x_{t} + y_{t} } \right) $. This generates an action map representing the pixel-wise optimal action step moving toward the target landmark location. For example, suppose one starts searching the landmark at a random location, say in the red region as show in Fig. 2, the optimal actions will keep moving up until hitting the line and then following the line to reach the target landmark. Using this action map representation, the landmark detection is essentially converted into an image partitioning problem.

2.2 Deep Image-to-Image Network Learning for Action Map Estimation

To estimate the action map for a given image, we employ a fully convolutional neural network given its efficient sampling scheme and large receptive field for comprehensive feature learning. Since both input (raw image) and output (action map) are images with the same size, we also call it a deep image-to-image network (DI2IN). Specifically, we follow the symmetric network architecture of SegNet [7]. The network is constructed with an encoder using the same structure as the fully convolutional part of VGG-16 network [4], and a decoder that replaces the pooling layers with upsampling layers and then essentially reverses the encoder structure. Batch normalization is used for each convolutional layer, and the max-pooling indices are kept during pooling and restored during upsampling. A softmax layer is used to provide categorical outputs, while cross-entropy loss is calculated and weighted by pre-computed class frequencies.

2.3 Action Map Aggregation for Landmark Detection

The landmark location needs to be derived from the estimated action map. However, the action map estimated by DI2IN may not always be in perfect shape as how it is constructed. There can be uncertainties around the partition lines between action types. It is also possible that there are islands of different action types, which are false predictions, inside a particular action partition. This undermines the robustness of lots of possible approaches for landmark derivation. For example, starting from a random point and moving along with the estimated action steps like DRL may not guarantee the convergence at the target landmark. Similarly, linear regression of the two partition lines may be disrupted even though the slopes are known, while dynamic programming based on the action flows can encounter dead locks. Here we propose an aggregative approach. With the output action map $ A $ from DI2IN, the estimated landmark location coordinates $ \left( {x^{{\prime }} , y^{{\prime }} } \right) $ are determined by maximizing an objective function $ C\left( \cdot \right) $ summed up with that of each action type $ C_{a} \left( \cdot \right) $.

$$ x^{{\prime }} ,y^{{\prime }} = \mathop {\text{argmax}}\nolimits_{x, y} C\left( {x,y} \right) = \mathop {\text{argmax}}\nolimits_{x, y} \sum\nolimits_{a} {C_{a} \left( {x, y} \right)} $$

(3)

where the action-wise objective function at pixel $ \left( {x,y} \right) $ is aggregated by the pixels with that specific action on the same row or column, specifically

$$ C_{a} \left( {x, y} \right) = \left\{ {\begin{array}{*{20}c} {d_{x}^{\left( a \right)} \left( {\sum\nolimits_{i = x}^{\infty } {\delta \left( {A\left( {i, y} \right) = a} \right)} - \sum\nolimits_{i = - \infty }^{x} {\delta \left( {A\left( {i, y} \right) = a} \right)} } \right) \quad if \left| {d_{x}^{\left( a \right)} } \right| = 1} \\ {d_{y}^{\left( a \right)} \left( {\sum\nolimits_{j = y}^{\infty } {\delta \left( {A\left( {x, j} \right) = a} \right)} - \sum\nolimits_{j = - \infty }^{y} {\delta \left( {A\left( {x, j} \right) = a} \right)} } \right) \quad if \left| {d_{y}^{\left( a \right)} } \right| = 1} \\ \end{array} } \right. $$

(4)

Note that the objective function increments with pixels pointing towards $ \left( {x, y} \right) $, while decrements with pixels pointing away from $ \left( {x, y} \right) $ (Fig. 3). Such aggregation enables robust location coordinate derivation even with suboptimal action map from the DI2IN output.

3 Methods and Results

3.1 Data

Two ultrasound datasets are used in this study including a cardiac and an obstetric dataset with 1353 and 1642 patients, respectively. Both datasets are collected and anonymized in the process of clinical routine. Landmarks of interest are annotated by clinical experts on each scan. We collect 8892 frames in the cardiac dataset in total across the entire heart cycles rather than collect the images just around end-systole and end-diastole as in [6]. Therefore, the cardiac dataset in our experiment presents larger contextual variations and greater challenges for landmark detection. On a cardiac scan, landmarks are defined as the two annulus points, which are the roots of mitral valve in apical 2 chamber (A2C) view and apical 4 chamber (A4C) view. In the obstetric dataset, each patient has only one scanned image. On an obstetric scan, the first landmark is annotated at the anterior end of the fetal head, while the second is at the posterior end. Note that the orientations of fetal head can essentially cover 360° across the ultrasound scans. Therefore, detecting these two landmarks on an obstetric scan is not an easy task even for humans. Careful identification of the internal brain structure is necessary for consistent manual annotation. For each dataset, 80% patients are randomly selected as training set, and the remaining 20% are used for testing. All images are normalized into 480 × 480 before further processing.

3.2 Experimental Setup

We apply the proposed approach to the cardiac and the obstetric ultrasound datasets individually. For each landmark, we train a DI2IN to learn its associated action map. The DI2IN are trained using the Caffe framework on a Linux workstation equipped with an Intel 3.50 GHz CPU and a 12GB NVidia Titan X GPU. The encoder part of DI2IN of is initialized with the weights of VGG-16 trained from ImageNet. During training, the mini batch size is set to 2, standard stochastic gradient descent is used for updates with learning rate as 1e–3 and momentum as 0.9 through 80,000 iterations. We compare the proposed SAC with other learning-based approaches on the same dataset including PBT, DRL, and a state-of-the-art regression-based approach using DI2IN [8] (we refer to it as I2I). Note that I2I and SAC uses similar network structure, while representing the landmark differently. For each method, we try our best to tune the configuration to provide reasonably good results. Distance error of landmark position in pixels is used for comparison since all images are in normalized space.

3.3 Qualitative and Quantitative Results

The action maps estimated from SAC (Fig. 4) are clean (very few islands of false predictions) and smooth (sharp separations between regions of different action types). It turns out to be beneficial to keep the pooling indices in DI2IN, which enforcing the smoothness of the estimated action map. Overall, the action maps look very reasonable even though they are not exactly the same as the ground truth (the partitioning lines are not straight). The derived landmark locations from the estimated action maps are also close to those of the manual annotations.

For cardiac scans, it is not too hard to identify a rough location of the target landmarks in the middle of left ventricle and left atrium; however, it is challenging to have precise localization given that we include cardiac phases throughout heart cycles, where the relative locations vary a lot between the annulus points of mitral valves and the surrounding structures. Overall, compared to PBT and DRL, our proposed method provides consistently better accuracy and robustness (Table 1). Compared to I2I, SAC presents slightly better overall performances.

Table 1. Distance errors of landmark detection in pixels.

Full size table

For the obstetric scans, it is very hard to identify the landmark location correctly without capturing the context in a large receptive field given lots of ambiguities around the almost radially symmetric structure. It is very likely to make major failures, especially if only local context are used for feature modeling (PBT and DRL), while confusion of head orientation can be substantially prevented using DI2IN (I2I and SAC). SAC demonstrates the best performance among all tested methods.

4 Discussion

In this paper, we introduce a new perspective to address landmark detection; we propose a novel approach inspired from DRL by converting the landmark detection problem into a supervised image partition task in the form of action maps. This landmark-to-image conversion enables the classifier to not only sample data in a more balanced manner (compared to PBT), but also capture more comprehensive image context across the entire image for the guidance of landmark detection (compared to DRL). Based on this conversion, we formulate a complete workflow by leveraging a deep DI2IN for action map estimation, and designing an action map aggregation for landmark estimation. Using this workflow, we present competitive performances against other state-of-the-art approaches on cardiac and obstetric ultrasound datasets. Further investigation on its clinical value will be performed as our next step, where more training data will be used for better performance, more engineering efforts will be spent for faster and smoother detection across frames, and more evaluations will be focused on the measurements derived from landmarks against human errors.

Our SAC approach is generic, and it has great synergy with DI2IN as observed in our experiments on ultrasound datasets. We observe big opportunities to improve the performances by integrating new technologies of training DI2IN, e.g., deep supervision [9] and skip connection [10]. Meanwhile, given the promising results in 2-D ultrasound for single landmark detection, it is worthwhile to explore the extension of SAC in (1) 3-D, (2) other image modalities, and (3) multi-landmark detection, where the action map generation and aggregation need to be adapted.

References

Tu, Z.: Probabilistic boosting-tree: learning discriminative models for classification, recognition, and clustering. In: Tenth IEEE International Conference on Computer Vision (ICCV 2005), vol. 2. IEEE (2005)
Google Scholar
Viola, P., Jones, M.: Fast and robust classification using asymmetric adaboost and a detector cascade. In: Advances in Neural Information Processing System, vol. 14 (2001)
Google Scholar
Zhou, S.K., Comaniciu, D.: Shape regression machine. In: Karssemeijer, N., Lelieveldt, B. (eds.) IPMI 2007. LNCS, vol. 4584, pp. 13–25. Springer, Heidelberg (2007). doi:10.1007/978-3-540-73273-0_2
Chapter Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Long, J., et al.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
Google Scholar
Ghesu, F.C., Georgescu, B., Mansi, T., Neumann, D., Hornegger, J., Comaniciu, D.: An artificial agent for anatomical landmark detection in medical images. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9902, pp. 229–237. Springer, Cham (2016). doi:10.1007/978-3-319-46726-9_27
Chapter Google Scholar
Badrinarayanan, V., et al.: SegNet: a deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. arXiv preprint arXiv:1505.07293 (2015)
Yang, D., et al.: Automatic vertebra labeling in large-scale 3D CT using deep image-to-image network with message passing and sparsity regularization. In: Niethammer, M., Styner, M., Aylward, S., Zhu, H., Oguz, I., Yap, P.-T., Shen, D. (eds.) IPMI 2017. LNCS, vol. 10265, pp. 633–644. Springer, Cham (2017). doi:10.1007/978-3-319-59050-9_50
Chapter Google Scholar
Xie, S., Tu, Z.: Holistically-nested edge detection. In: Proceedings of the IEEE International Conference on Computer Vision (2015)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). doi:10.1007/978-3-319-24574-4_28
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Medical Imaging Technologies, Siemens Healthineers Technology Center, Princeton, NJ, 08540, USA
Zhoubing Xu, JinHyeong Park, Mingqing Chen, Daguang Xu, David Liu & S. Kevin Zhou
Department of Computer Science, University of Southern California, California, LA, 90089, USA
Qiangui Huang
Department of Computer Science, Rutgers University, Piscataway, NJ, 08854, USA
Dong Yang

Authors

Zhoubing Xu
View author publications
You can also search for this author in PubMed Google Scholar
Qiangui Huang
View author publications
You can also search for this author in PubMed Google Scholar
JinHyeong Park
View author publications
You can also search for this author in PubMed Google Scholar
Mingqing Chen
View author publications
You can also search for this author in PubMed Google Scholar
Daguang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Dong Yang
View author publications
You can also search for this author in PubMed Google Scholar
David Liu
View author publications
You can also search for this author in PubMed Google Scholar
S. Kevin Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhoubing Xu .

Editor information

Editors and Affiliations

Université de Sherbrooke, Sherbrooke, QC, Canada
Maxime Descoteaux
DKFZ, Heidelberg, Germany
Lena Maier-Hein
Ulm University of Applied Sciences, Ulm, Germany
Alfred Franz
Université de Rennes 1, Rennes, France
Pierre Jannin
McGill University, Montreal, QC, Canada
D. Louis Collins
Université Laval, Québec, QC, Canada
Simon Duchesne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, Z. et al. (2017). Supervised Action Classifier: Approaching Landmark Detection as Image Partitioning. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D., Duchesne, S. (eds) Medical Image Computing and Computer Assisted Intervention − MICCAI 2017. MICCAI 2017. Lecture Notes in Computer Science(), vol 10435. Springer, Cham. https://doi.org/10.1007/978-3-319-66179-7_39

Download citation

DOI: https://doi.org/10.1007/978-3-319-66179-7_39
Published: 04 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66178-0
Online ISBN: 978-3-319-66179-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)