1 Introduction

Heart disease is the leading cause of death globally, claiming the lives of over 8.5 million people in year 2015 alone [17]. Left ventricular ejection fraction (LVEF) is an important cardiac parameter and the key predictor for prognosis in most cardiac conditions, including valve disease, coronary artery disease, and heart failure [3]. Formally, LVEF is defined as the ratio between the amount of blood pumped out of the left ventricle (LV) every systole and the maximum amount of blood in LV at the end of diastole. The most common imaging modality for measuring LVEF is echocardiography (echo) [3]. Echo is non-ionizing, accessible, low-cost, real time, and therefore ideal for studying the cardiac anatomy and function. In 2D echo, LVEF is conventionally quantified using the biplane method of disks, a.k.a. Simpson’s rule [3]. This method calculates LVEF through LV volume estimation in end-systolic (ES) and end-diastolic (ED) frames, from apical two-chamber (A2C) and apical four-chamber (A4C) views. This segmentation-based routine is time-consuming and challenging with the presence of noise and unclear endocardial boundaries. Furthermore, studies suggest manual measurement of LVEF suffers from intra- and inter-user variability, especially among novice cardiologists [2, 5]. To assist with automation of LV segmentation, several solutions have become commercially available [19]. A number of research groups have also proposed semi-automatic and automatic LV segmentation techniques, including recent machine learning and deep learning approaches [6, 8, 14, 15, 21]. Though promising for LV volume estimation in a given frame, these methods can lack robustness for LVEF prediction. This is due to dependence of LVEF on accurate LV tracing in ED and ES.

Fig. 1.
figure 1

Comparison of LV motion in ES and ED phases of SAX (a), A2C (b) and A4C (c). Deformations, movements of chambers and valves are more complex in A2C and A4C (used in echo) compared to SAX (used in MR), causing LVEF assessment to be more difficult in echo.

Clinically, LVEF is often measured through direct visual estimation [13]. Experienced cardiologists can eyeball LVEF from echo cine loops based on the wall motion and atrio-ventricular plane displacements [13]. Studies suggest direct visual estimation of LVEF is closely correlated to quantitative segmentation-based techniques [10]. Though this is the preferred choice of experts for quick LVEF assessment, visual estimation is a highly reader-dependent technique, leading inexperienced novice imagers to hesitate to use it [3, 13]. Moreover, eyeballing LVEF is not a reliable option for other clinicians with limited echo training.

Direct estimation of LV volume and LVEF in cardiac magnetic resonance (MR) images has been explored by several groups [9, 12, 20, 22]. Nevertheless, to the best of our knowledge, direct LVEF assessment has not been previously investigated in echo images. It is worth noting that LVEF estimation in echo is inherently a much more difficult problem compared to MR for several reasons. First, variability in acquiring standard echo imaging planes introduces greater variance in the appearance of the LV anatomy in 2D echo images. Moreover, the short-axis (SAX) view, used for LVEF estimation in MR (Fig. 1(a)), captures a much simpler cardiac motion and field-of-view compared to the views used in echo (Fig. 1(b) and (c)). Other challenges in echo include variable image quality and image settings, which also add to the complexity of a machine learning-based solution for direct LVEF assessment.

In this paper, we introduce a deep network that mimics the clinicians’ eye-balling technique in echo to help classify exams as high-risk (\(\text {LVEF}\le 40\%\)) or low-risk (\(40\%<\text {LVEF}\le 75\%\)). The following contributions are made: (1) Our approach directly estimates LVEF from echo cine loops, eliminating the need for LV segmentation and detection of key cardiac frames. LV segmentation can be challenging due to the high variability in echo image quality and image settings, as well as variability in the operator’s experience in obtaining the correct echo standard views; (2) We propose a dual-stream framework for A2C and A4C views, consisted of view-specific spatial feature extraction blocks as well as shared recurrent neural network (RNN) layers. (3) We report the performance of several state-of-the-art networks and empirically show that for all the dual-view framework perform equally or better than a single apical view in classification of low-risk vs. high-risk LVEF.

2 Material

LVEF Labels: Our objective is to distinguish between the low-risk and high-risk LVEF classes. Let \(\mathbf {Y}_{\text {Simpson's}}\) and \(\mathbf {Y}_{\text {Binary}}\) denote the Simpson’s rule-based gold standard LVEF measurement and derived risk-based binary labels, respectively. We define \(\mathbf {Y}_{\text {Binary}}\) such that \(\mathbf {Y}_{\text {Binary}}=1\) for \({\mathbf {Y}}_{\text {Simpson's}}\le 40\%\), and \(\mathbf {Y}_{\text {Binary}}=0\) for \({40\%<\mathbf {Y}}_{\text {Simpson's}}\le 75\%\). Figure 2 visualizes the clinical labels in the database (\({\mathbf {Y}}_{\text {Simpson's}}\) and \({\mathbf {Y}}_{\text {Eyeballed}}\)) and the derived risk-based binary labels used in the present classification network (\(\mathbf {Y}_{\text {Binary}}\)). Cases with \({\mathbf {Y}}_{\text {Simpson's}}>75\%\) are excluded from this study due to the very limited number of samples.

Fig. 2.
figure 2

LVEF labels used in the main database (\(\mathbf {Y}_{\text {Simpson's}}\), \(\mathbf {Y}_{\text {Eyeballed}}\)) and for classification in this paper.

Fig. 3.
figure 3

Examples of synchronized A2C and A4C echo cines. Cines are temporally resampled between \(R_1^{\text {AXC}}\) to \(R_2^{\text {AXC}}\), and effectively synchronized.

Database: Ethics approval was obtained from our local regulatory authority to access a database of clinical echo exams and corresponding diagnostic reports at a tertiary care center. We searched the report database for echo exams that satisfied the following criteria: (1) The segmentation-based (\(\mathbf {Y}_{\text {Simpson's}}\)) and segmentation-free (\(\mathbf {Y}_{\text {Eyeballed}}\)) LVEF labels are recorded in the report database, and in agreement; (2) Correspondences can be found between the echo cines and diagnostic report based on the study identification information; (3) A2C and A4C views are both available. Also, in this paper, we focus the studies acquired using the same family of ultrasound machines (Philips iE33). A total of 1,186 samples with the above criteria were gathered; 541 high-risk and 645 low-risk cases. The dataset was divided in a 4 : 1 ratio for training and test.

Echo Data and Preparation: 2D frames of \(800\times 600\) pixels are cleaned using a binary beam-shaped mask, cropped around the beam area, and downsized to \(128 \times 128\) pixels. Temporally, frames are sampled from one full visible cycle in each cine loop \(\text {AXC}\), where \(\text {AXC}\in \{\text {A2C}, \text {A4C}\}\). To extract one cycle from each \(\text {AXC}\) cine, we find the of R peaks in its available electrocardiograms (ECG) and trim the cine to frames \(R_1^{\text {AXC}}\) to \(R_2^{\text {AXC}}\). An equal number of \(F=25\) frames are uniformly sampled from each sequence (Fig. 3).

Fig. 4.
figure 4

Architecture of proposed multi-view classification network for LVEF estimation.

3 Methods

We propose the network in Fig. 4 for binary LVEF classification. This network is consisted of spatial feature extraction (FE) blocks as well as RNN-based layers for temporal learning.

Dual-view Spatial Feature Learning: We rely on CapsuleNet [18] and DenseNets [11] for frame feature extraction (FE), as they have been recently proved successful in spatial feature learning. Initially, sampled synchronous A2C and A4C frames are fed into FE blocks. The flattened output of an FE for a frame t is a feature vector \(\mathbf {X}_{m,t}^{\text {AXC}}\) of length \(M\times 1\); \(m=1:M\). In the dual-view framework, \(\mathbf {X}_{m,t}^{\text {A2C}}\) and \(\mathbf {X}_{m,t}^{\text {4XC}}\) are then concatenated to form a dual-view feature vector \(\mathbf {X}_{m,t}^{\text {A2C+A4C}}\) of length \(2M\times 1\). For an exam with two streams and sequence length of F frames, a feature matrix \(\mathbf {X}_{m,t}^{\text {A2C+A4C}}\) of size \(2M\times F\) is constructed, where \(t=1:F\). \(\mathbf {X}_{m,t}^{\text {A2C+A4C}}\) is a dense representation of the cardiac cycle based on two views.

RNNs for Temporal Encoding: The other key components in the network are the RNN blocks, which enable sequential and temporal learning. We investigated various RNNs, including cascades of uni- and bi-directional Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU). The RNN blocks take in \(\mathbf {X}_{m,t}^{\text {A2C+A4C}}\) at F separate time steps and output an array of the learned sequential features. This output is further pushed to a cascade of two fully connected (FC) layers, with ReLU and Softmax activation functions, respectively.

Training: The proposed architecture is implemented in Python using Keras with TensorFlow backend [4]. Dropout and batch normalization layers are used after FE blocks to prevent overfitting. The start points of the sampled frames are selected at random within the range \(R_1^{\text {AXC}}\) to \(R_2^{\text {AXC}}\). Augmented data is created on the fly via randomly generated transforms, including rotation, scaling, cropping and gamma transformation on the intensities.

4 Results

Quantitative results obtained in this study are demonstrated in Fig. 5. The highest performance is achieved using the dual-view approach with DenseNets and bidirectional GRUs. Figure 6 depicts this network’s performance on a few examples of A2C+A4C image pairs.

Fig. 5.
figure 5

LVEF classification accuracy using DenseNet (DNet) and CapsuleNet (CNet) as the spatial FE, and various and RNN versions on A2C, A4C and A2C+A4C views.

Fig. 6.
figure 6

Performance of DenseNets and bidirectional GRU on a few A2C+A4C pairs. Cardiac echo quality and proper synchronization of views affect the model performance.

5 Discussion and Conclusion

In this paper, we introduced a new framework based on DenseNet, CapsuleNet and RNN layers for estimating LVEF from echo cines in A2C and A4C standard echo views. Our results suggest that A2C alone is a less reliable view for LVEF estimation, while A4C alone appears to be a much more robust option with the current framework. However, the most accurate results is achieved by combining both apical views. This observation is also aligned with anecdotal clinical evidence, where A2C views are more difficult to obtain over A4C, and are more likely to be foreshortened [16], hence LVEF estimation from A2C can be less reliable. LSTM and GRU often performed equivalently, although the highest accuracy was obtained using GRU blocks. The results also consistently suggest that bidirectional recurrent layers are equivalent to or better than unidirectional ones. The optimal deep model, consisted of DenseNet + bidirectional GRU, achieved a success rate of 83.1% on the test set for detecting high-risk LVEF. We observed that DenseNet achieved a higher accuracy, compared to CapsuleNet. Given the performance of CapsuleNet on public data sets [18], this was inconsistent with our initial expectations. However, we suspect that this is due to the small size of our training set for learning such a complex, yet subtle, problem. DenseNets have been proven effective for learning spatial features in relatively small training sets [11]. It is worth mentioning that based on our analysis of the main diagnostic report database, only an approximate 70.1% of the (\(\mathbf {Y}_{\text {Simpson's}}\)) and (\(\mathbf {Y}_{\text {Eyeballed}}\)) labels agree. While these cases were excluded from the presented study, we suspect that the accuracy of the clinical ground truth labels may be similarly compromised to some extent.

A key pattern recognized from the results is the link between model performance, the quality of apical images, and view synchronization (Fig. 6). Misclassified images generally have unclear LV boundaries, which causes a great deal of variance in the appearance of the heart and its motion. Also, despite the automatic and manual view classification, confusion between the four apical views (A2C, three-chamber, A4C and five-chamber) appears to remain a challenge and a potential source of error (e.g., Fig. 6(c)). Thus, a bottom-up approach for improving LVEF accuracy can be through improving the quality of the input data. Abdi et al. recently proposed a deep-learning solution for automatic estimation of echo quality [1], which can be used to provide feedback to ultrasound operators for improving the quality of data acquisition.

A resolvable limitation of the proposed solution is the dependence on ECG, for phase detection and synchronization. ECG is not available in point-of-care. Moreover, visual inspection of the results revealed correlation between misclassification and apparent improper synchronization (see e.g., Fig. 6(d), which shows asynchronous A2C and A4C views based on the valve state). We believe improving the phase detection can contribute to achieving more accurate results. Alternatively, a cine-based cardiac phase detection can be implemented into the network. A possible solution has been proposed by Dezaki et al. [7] for A4C images, which can be similarly extended to A2C. This method is capable of automatically identifying ES and ED, which could be used to achieve potentially richer temporal sampling of systolic and diastolic phases.

One possible option to eliminate phase-dependence altogether is through having two separate RNN streams; one per A2C and A4C views. This decouples the two views from one another, enabling the use of potentially informative cines in full. However, this architecture causes a large sudden increase in the network size, and is still less successful for LVEF estimation based on our experiments thus far. This is most likely because the inputs of the RNN blocks, i.e. the frame feature vectors, are denser and richer when constructed from two complimentary views, allowing for more effective temporal learning. This may change should we increase our training set.

While a binary risk-based LVEF classification tool could assist with immediate decision making in point-of-care, it suffers from a flaw: it imposes a sharp boundary on the true regression labels (\(\mathbf {Y}_{\text {Simpson's}}\)). This can be amended by adding a medium-risk class, or more classes of \(\mathbf {Y}_{\text {Eyeballed}}\). We plan to include exams from other ultrasound machines to obtain enough data for this multi-class classification.

Given that LV localization appears to be the key step in some LVEF estimation approaches proposed for cardiac MR [12], another question worth exploring is whether LV localization helps with LVEF accuracy in echo. While the motion of the atria and right ventricle can contain subtle information about LVEF, excluding them decreases variance from the neighbouring chambers. Existing encoder-decoder segmentation networks can be modified and used to localize, track and accordingly crop LV throughout the cine.