Straight to the Point: Reinforcement Learning for User Guidance in Ultrasound

Milletari, Fausto; Birodkar, Vighnesh; Sofka, Michal

doi:10.1007/978-3-030-32875-7_1

Fausto Milletari¹⁹,
Vighnesh Birodkar¹⁹ &
Michal Sofka¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11798))

Included in the following conference series:

1995 Accesses

Abstract

Point of care ultrasound (POCUS) consists in the use of ultrasound imaging in critical or emergency situations to support clinical decisions by healthcare professionals and first responders. In this setting it is essential to be able to provide means to obtain diagnostic data to potentially inexperienced users who did not receive an extensive medical training. Interpretation and acquisition of ultrasound images is not trivial. First, the user needs to find a suitable sound window which can be used to get a clear image, and then he needs to correctly interpret it to perform a diagnosis. Although many recent approaches focus on developing smart ultrasound devices that add interpretation capabilities to existing systems, our goal in this paper is to present a reinforcement learning (RL) strategy which is capable to guide novice users to the correct sonic window and enable them to obtain clinically relevant pictures of the anatomy of interest. We apply our approach to cardiac images acquired from the parasternal long axis (PLAx) view of the left ventricle of the heart.

You have full access to this open access chapter, Download conference paper PDF

Goal-Conditioned Reinforcement Learning for Ultrasound Navigation Guidance

Force-guided autonomous robotic ultrasound scanning control method for soft uncertain environment

Article 09 August 2021

Artificial Intelligence in Point-of-care Ultrasound

Article 22 May 2024

1 Introduction

Ultrasound (US) is a flexible, portable, safe and cost effective modality that finds several applications across multiple fields of medicine.

The characteristics of ultrasound make it extremely suitable for applications related with emergency medicine and point of care (POC) decision making. Recently, several ultra-portable and lightweight ultrasound devices have been announced and commercialized to enable these applications. These products have been envisioned to be extremely inexpensive, have a long battery life, a robust design and to be operated by inexperienced users who may have not received any formal training. In order to reach the latest goal, images need to be interpreted by a computer vision based system and accurate instruction for fine manipulation of the ultrasound probe need to be provided to the user in real time.

In this paper we show how to use deep learning and in particular deep reinforcement learning to create a system to guide inexperienced users towards the acquisition of clinically relevant images of the heart in ultrasound. We focus on acquisition through the parasternal long axis (PLAx) sonic window on the heart which is one of the most used views in emergency settings due to its accessibility.

In our acquisition assistance framework the user is asked to place the probe anywhere on the left side of the patient’s chest and receives instructions on how to manipulate the probe in order to obtain a clinically acceptable parasternal long axis scans of the heart. Every time an image is produced by the ultrasound equipment, our deep reinforcement learning model predicts a motion instruction that is promptly displayed to the user. In this sense, we are learning a control policy that predicts actions (also called instructions) in correspondence of observations, which makes reinforcement learning a particularly suitable solution. This problem has several degrees of freedom. Apart from instructions regarding left-right and top-bottom motions, the user will also receive fine-grained manipulation indications regarding rotation and tilt of the probe.

Reinforcement learning has been recently employed to solve several computer vision related problems and specifically to achieve superhuman performances in playing ATARI games and 3D video-games such as “Doom” [3].

In [7, 8] a convolutional deep neural network has been employed together with Q-learning to predict the expected cumulative reward Q(s, a) associated with each action that the agent can perform in the game. In [5] a learning strategy that employs two identical networks, updated at different paces, is presented. In this paper, the target network is used for predictions and is updated smoothly at regular intervals, while the main network gets updated batch-wise through back-propagation. This is particularly useful in continuous control. In [13] the network architecture used to predict the Q-values is modified to comprise two different paths which predict, respectively, the value V(s) of being in a certain state and the advantage of taking a certain action in correspondence to that state. This strategy has resulted in increased performances. In [12] target Q-values, which are learned during training, are computed differently than in [7]. Instead of having the network regress Q-values computed as the reward $r_t$ plus $\gamma \,\mathop {\mathrm {arg}\,\text {max}}\nolimits _a{Q^{*}(s_{t+1},a)}$, they use $r_t + \gamma Q^{*}(s_{t+1},a_{t+1})$. The main difference is that, in the latter, the action $a_{t+1}$ is the one that is selected by the network in correspondence of the state $s_{t+1}$, and not a which is the one yielding the maximum Q-value. This yields increased stability of the Q-values.

Reinforcement learning has been applied in medical domain for the first time in [10] to segment ultrasound images. In [9] a similar approach has been applied to heart model personalization on synthetic data. In [1] a DQN has been employed to solve a optimal view plane selection problem in MRI, through an agent trained to obtain a specific view of brain scans.

In this work we apply deep reinforcement learning (via a DQN) to a guidance problem whose goal is to provide instructions to users in order to enable them to scan the left ventricle of the heart using ultrasound through the parasternal long axis sonic window. We build our learning strategy to perform end-to-end optimization of the guidance performances and we train our agent using an simulated US acquisition environment. We compare the performances of our method with the ones obtained by training a classifier to learn a policy on the same data in a fully supervised manner.

2 Method

A RL problem is usually formulated as a Markov decision process (MDP) (Fig. 1 left). At each point in time, the agent observes a state $S_t$ and interacts with the environment, using its policy $\pi \in \varPi $, through actions $a \in A$ obtaining a finite reward $r_t$ together with a new state $S_{t+1}$. $\varPi $ is the set of all possible policies while A is the set of all supported actions.

The set of supported actions, in our system, contains 9 actions as shown in Table 1.

Table 1. Set of actions supported by the agent. These action are mapped to the corresponding effect in the simulated acquisition framework.

Full size table

In this section we present the details of our implementation.

2.1 Simulated Acquisition Environment

In order to learn from experience, our reinforcement learning agent needs to collect data according to its policy by physically moving the probe on the chest of the patient in order to obtain data and rewards. It is unfortunately impossible to implement real-time interaction due to the fact that acquiring the trajectories would take an enormous amount of time and a patient would need to be scanned for the whole duration of learning.

We have resorted to acquiring, independently from our learning procedure, a large number of spatially tracked video frames from patients. By drawing spatial relationships between the frames, we are able to navigate the chest area offline and obtain simulated trajectories. We have defined, for each participant in the study, a work area covering a large portion of their chest. We have divided this area into $7\times 7$ mm spatial bins. The bins from which it is possible to obtain a valid PLAx view by fine manipulation of the probe, are annotated as “correct” while all other bins remain unmarked. This annotation is necessary to implement the reward system.

Our system offers guidance for 4 out of the 5 degrees of freedom of probe motion (Fig. 2 right). We get data for the first two degrees of freedom, left-right and top-bottom translations, by moving the probe in a regular and dense grid pattern over the chest in order to “fill” each bin of the grid with at least 25 frames. In correspondence of the bins marked “correct”, the sonographer is also asked to acquire 50 “correct” frames, showing the best view and 50 frames from each of the following scenarios: the probe is rotated by an excessive amount in the (i) clockwise or (ii) counterclockwise direction, or the probe is tilted by an excessive amount in the (iii) infero-medial or (iv) supero-lateral direction. In this way data for the last two degrees of freedom is obtained.

In order to build the environment we need to track both the body of the patient and the probe as data gets acquired. A schematic representation of our tracking system is shown in Fig. 2 (left). The tracking system, a NDI Polaris Vicra optical tracker, produces in real time a tracking stream consisting of two $4\times 4$ transformation matrices $T_{track>probe}$ and $T_{track>body}$. The transform $T_{probe>image}$, which is necessary to obtain the true pose of each picture acquired through our system, is obtained by performing calibration with the open source software fCal, which is provided as part of the PLUS framework [4]. The video frames are acquired through an ultrasound probe and supplied to the data acquisition system through OpenIGTlink interface [11]. The tracking and video streams are handled and synchronized using the PLUS framework in order to obtain tracked frames.

During training/testing the agent interacts with the simulated environment by performing actions which result in state changes and rewards. The actions can have the effect of either stopping the virtual probe (“NOP” action), bringing it closer or further away from the nearest goal point.

At the beginning of each episode the environment is reset and a virtual “probe” is randomly placed in one of the bins. Actions bringing the agent further from the correct bin result in negative rewards of $-0.1$, motion towards the correct view result in a reward of 0.05, “NOPs” issued at the correct bin and for the correct view result in a 1.0 reward and “NOPs” issued in correspondence of an incorrect view result in a penalty of 0.25.

2.2 Deep Q-Network

In this work we implement the Q-learning paradigm already employed by [7, 8]. This off-policy learning strategy leverages a convolutional neural network to regress Q-values which are the expected cumulative rewards associated with each action in correspondence of a state. As previously stated, the input of the model are ultrasound images, and its output is represented by nine Q-values, one for each action. Similarly to [5] we instantiate two copies of the same network. We have a target network which produces the values $Q_{\theta ^{*}}(s,a)$ and a main network which predicts $Q_{\theta }(s,a)$.

In order to train our agent we interact with the training environments. Each environment refers and represents to one patient. During an episode, we select an environment among those available for training and we reset the virtual probe to a random position. We then use the main network to collect experience by interacting with the environment. We implement exploration using an epsilon-greedy strategy which randomly hijacks and replaces the actions chosen through $\mathop {\mathrm {arg}\,\text {max}}\nolimits _a(Q_{\theta }(s,a))$ with random ones. In this way we are able to balance the needs for exploring the environment and the follow the learned policy. All agent’s experiences are collected in an experience replay buffer of adequate size as previously done in [7]. Since all our data is pre-acquired it is possible to increase the memory efficiency of the experience replace buffer by storing in memory image paths on the file system instead of storing uncompressed images.

Once there is enough data in the experience replay buffer, we sample random training batches from it and we use them to update the parameters $\theta $ of the main network using back-propagation. The objective function that we minimize with respect to the parameters of the network, using ADAM as our optimizer, is

$$ C(\theta ,s_{t},a_{t})=\left\| Q_{\theta }(s_{t},a_{t})-T(s_{t},a_{t})\right\| _{2}^{2} $$

$$ T(s_{t},a_{t})=r_{t}+\gamma \mathop {\mathrm {arg}\,\text {max}}\limits _a(Q_{\theta ^{*}}(s_{t+1},a)) $$

The parameters $\theta ^{*}$ of the target network are updated with the parameters of the main network once every 250 episodes.

A schematic representation of the network architecture is shown in Fig. 1 (right). This network makes use of global average pooling [6] applied after the output of the last convolutional layer. All the non-linearities employed throughout the network are exponential linear units (ELU) [2]. The network outputs a 9-dimensional vector representing Q values.

During testing, the target network interacts with the environment. All actions are chosen deterministically through $\mathop {\mathrm {arg}\,\text {max}}\nolimits _a(Q_{\theta ^{*}}(s,a))$ which is, therefore, a stationary deterministic policy.

2.3 Supervised Policy Learning

In order to obtain means of comparison for our approach we have implemented a supervised policy learning approach which relies on classification and labeled data to learn the right action to perform in correspondence of each state. When we acquire data from patients we build environments where the parameters of the correct view in terms of translation, rotation and tilt are known. This enables us to label each image in each bin of the grid with one action, which would be the optimal action to perform in that state if we rely only on the Manhattan distance $abs(\mathbf {x} - \mathbf {x}_{goal})$ between the bin position $\mathbf {x}$ on the grid and the goal bin position $\mathbf {x}_{goal}$. In particular, for each bin of the grid, we choose the label for its images as the action that reduces the distance to the goal on the axis where the distance is currently the smallest.

We train a classifier with the same architecture shown in Fig. 1 (right), with the only exception that the last layer is followed by a soft-max activation function. We use all the data that is available to our reinforcement learning agent, shuffled and organized in batches of the same size of the ones used for our DQN.

During testing we use the same environments used by the reinforcement learning agent to test the supervised policy end-to-end on the guidance task. In this way we can compare on fair grounds the performances of the two strategies.

3 Results

We evaluate our method on the end-to-end guidance task described in the previous sections, using one environment for each patient. We train our approach on 22 different environments corresponding to circa 160000 ultrasound images, and we test our approach on 5 different environments which contain circa 40 thousand scans. During testing with start from each and every grid bin of each environment and we test the guidance performances of the approach.

As previously explained, we train both a RL-based approach and a supervised classification-based approach. Results are shown in Table 2.

We perform data augmentation for both the supervised and RL approaches. Each training sample is slightly rotated, shifted and re-scaled by a random quantity before being presented as input to the network. Also the gamma of the images is subject to augmentation. The episodes have a standard duration of 50 steps and “NOP” operations do not terminate the episode. Instead, a new, randomly selected, image from the same grid bin is returned to the agent. This is similar to what happens in practice when a user keeps the probe in the same location.

Table 2. Summary of performance of RL approach versus supervised approach on the test data-set.

Full size table

Our results are summarized in Table 2. The table is split in two parts: the first part summarizes the performances of the method on the end-to-end guidance task and inform us on the percentage of correct and incorrect guidance. That is, the percentage of episodes that have ended in a “correct” bin. Additionally we report the percentage of “NOPs” that have been issued at an incorrect location. Please note that “NOP” can be issued multiple times during a single episode. The agent may have briefly issued an “incorrect NOP” even during successful episodes. The evaluation reveals that the supervised approach is less successful than the RL approach on the guidance task. The second part of the table reveals information about the behaviour of the reward. Also these results demonstrate that our RL agent is performing more “correct” actions than its supervised counterpart.

4 Conclusion

Our approach employs reinforcement learning (RL) to guide inexperienced users during cardiac ultrasound image acquisition. The method achieves better results than a similar (non-RL) supervised approach trained on the same data and tested on the end-to-end guidance task. The intuition behind this is that RL is able to avoid and go around areas that are highly ambiguous as the agent learns to predicted rather low Q-Values in correspondence of actions leading to ambiguous states.

Although the results have shown to be promising there are still issues related with the data acquisition strategy of our approach and the long training time. In conclusion, we believe that this method is the one of the first step to converge towards a solution which aims to solve the guidance task end-to-end in a more reliable and effective manner.

References

Alansary, A., et al.: Automatic view planning with multi-scale deep reinforcement learning agents. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11070, pp. 277–285. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00928-1_32
Chapter Google Scholar
Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (ELUs). arXiv preprint arXiv:1511.07289 (2015)
Lample, G., Chaplot, D.S.: Playing FPS games with deep reinforcement learning. In: AAAI, pp. 2140–2146 (2017)
Google Scholar
Lasso, A., Heffter, T., Rankin, A., Pinter, C., Ungi, T., Fichtinger, G.: PLUS: open-source toolkit for ultrasound-guided intervention systems. IEEE Trans. Biomed. Eng. 61(10), 2527–2537 (2014)
Article Google Scholar
Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)
Lin, M., Chen, Q., Yan, S.: Network in network. arXiv preprint arXiv:1312.4400 (2013)
Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Article Google Scholar
Neumann, D., et al.: A self-taught artificial agent for multi-physics computational model personalization. Med. Image Anal. 34, 52–64 (2016)
Article Google Scholar
Sahba, F., Tizhoosh, H.R., Salama, M.M.: A reinforcement agent for object segmentation in ultrasound images. Expert Syst. Appl. 35(3), 772–780 (2008)
Article Google Scholar
Tokuda, J., et al.: OpenIGTLink: an open network protocol for image-guided therapy environment. Int. J. Med. Robot. Comput. Assist. Surg. 5(4), 423–434 (2009)
Article Google Scholar
Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: AAAI, pp. 2094–2100 (2016)
Google Scholar
Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., De Freitas, N.: Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581 (2015)

Download references

Author information

Authors and Affiliations

4Catalyzer Inc., Santa Clara, USA
Fausto Milletari, Vighnesh Birodkar & Michal Sofka

Authors

Fausto Milletari
View author publications
You can also search for this author in PubMed Google Scholar
Vighnesh Birodkar
View author publications
You can also search for this author in PubMed Google Scholar
Michal Sofka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fausto Milletari .

Editor information

Editors and Affiliations

Shanghai Jiao Tong University, Shanghai, China
Qian Wang
King's College London, London, UK
Alberto Gomez
King's College London, London, UK
Jana Hutter
GE Vingmed Ultrasound, GE Healthcare, Oslo, Norway
Kristin McLeod
School of Imaging Sciences & Biomedical Engineering, King's College London, London, UK
Veronika Zimmer
ImFusion GmbH, Munich, Germany
Oliver Zettinig
Computer Vision Lab, TU Vienna, Vienna, Austria
Roxane Licandro
School of Imaging Sciences & Biomedical Engineering, King's College London, London, UK
Emma Robinson
School of Imaging Sciences & Biomedical Engineering, King's College London, London, UK
Daan Christiaens
Boston Children's Hospital, Boston, UK
Esra Abaci Turk
School of Imaging Sciences & Biomedical Engineering, King's College London, London, UK
Andrew Melbourne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Milletari, F., Birodkar, V., Sofka, M. (2019). Straight to the Point: Reinforcement Learning for User Guidance in Ultrasound. In: Wang, Q., et al. Smart Ultrasound Imaging and Perinatal, Preterm and Paediatric Image Analysis. PIPPI SUSI 2019 2019. Lecture Notes in Computer Science(), vol 11798. Springer, Cham. https://doi.org/10.1007/978-3-030-32875-7_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-32875-7_1
Published: 08 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32874-0
Online ISBN: 978-3-030-32875-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)