Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Over recent decades, large-scale databases containing prominent imaging components in addition to omics, demographics and other metadata have increasingly underpinned national population imaging cohorts [1, 7, 10], biomedical research (cf. [10]), and pharmaceutical clinical trials (cf. [6]). Large imaging databases provide new opportunities for the comprehensive characterization of the population or disease phenotype but equally pose new challenges. Collection and process of large databases require automatic streamlined image analysis pipelines that are robust to the varying quality of images. In contrast, most image analysis methods developed in the last three decades are designed for relatively small datasets where visual quality checking and pruning is still feasible, and thereby not exposed to heterogeneous image quality. On the other side, visual quality assurance in multi-centre large-scale cohorts with several imaging sequences per subject is simply infeasible. Despite efforts to enforce strict imaging protocols to ensure consistent high-quality images, various incidental artefacts are inevitable, which will have a significant impact on the analysis techniques [8]. Detecting this small fraction of inadequate data requires an enormous amount of expert manual labour or resourcing to sampling schemes which are prone not to detect all inadequate data. Hence, automatic image quality control in population imaging studies is an emerging challenge to the medical image computing community.

In the multimedia community, image quality assessment (IQA) has been explored extensively over the last two decades [5]. However, existing algorithms are not directly applicable to biomedical images mainly for three reasons. Firstly, multimedia IQA algorithms generally quantify image quality in relationship to human visual perception rather than being task-oriented, thereby are not appropriate for detecting medical image artefacts as pointed out by Xie et al. [12]. For a description of such artefacts, see for instance, Jacobson et al. [3] for dual energy X-ray absorptiometry (DXA). Secondly, in multimedia, a limited number of artefacts such as blurring, noise, and JPEG compression are normally of interest. This is in stark contrast with much more diverse and difficult to catalogue artefacts in medical imagery. Furthermore, most of the literature on imaging artefacts has been focused on their origins in the physics of image acquisition [11] or in their qualitative description (cf. [3]), which are not helpful to develop automated techniques for their detection. Albeit some research in the medical community to address specific artefact types such as eddy currents and head motion in diffusion MRI [8], or blurring and uneven illumination in dermoscopy [12] in an automatic way, they are not general purpose and cannot detect unforeseen incidental artefacts in large-scale databases. Thirdly, artefacts in multimedia are usually easier to model mathematically, which can be used to synthetically generate annotations over arbitrarily large image sets including artefacts with varying degree of severity. In contrast, manual quality annotations in large-scale medical image databases are rare and their creation would require extensive and tedious visual assessment, which is laborious and error prone. A completely unsupervised framework would, therefore, be desired.

In this work, we propose a general purpose and fully unsupervised approach to detect images with artefacts in large-scale medical image datasets. Furthermore, our approach also localizes the artefact within the corresponding image. Our work only relies on three main assumptions. Firstly, artefacts have a local nature by observing image patches of an appropriate size albeit the extent of such patches could be, in the extreme case, the full image. Secondly, the image database is large enough so as to capture the statistics of the normal images and that of the artefacts of interest. Thirdly, the incidence of artefacts in the image database should be small enough so that artefacts remain always outliers in the statistical distribution of the database. Under these assumptions, we propose a novel image coverage technique based on a visual dictionary of image patches to represent each image with a set of visual words localized in the image domain. Artefact detection is based on the similarity between each representative visual word and the corresponding raw patch from the image. The proposed method will facilitate large-scale, high-throughput quality control, and will be applicable to different imaging modalities without the need for manual annotation.

2 Proposed Method

The basic idea is to build a model representing the normal characteristics of the image population and detect the artefacts as deviations from this model. Based on the assumptions above, the proposed method works on image patches and comprises three main constituents: a robust learning of a dictionary of patches, an optimal coverage of the images with the dictionary words, and an assessment of the similarity between each covered patch and the corresponding dictionary word. This assessment allows us to detect outliers, identifying both images with artefacts and their locations in the image.

The use of a visual dictionary has been proposed for image quality assessment in [13] and explored later as the so-called bag of visual words (BOVW) representation. In the BOVW approach, each image is modelled as a distribution over the visual words by normalizing the histogram of occurrence counts for each visual word. In this study, however, we propose a novel image representation with a subset of visual words that provides an optimal image coverage, which is a new concept and differs from the typical usage of visual codebooks. The proposed optimal image coverage allows the visual words to be positioned optimally without prefixed positions. This reduces the number of required words for a good representation, which maximizes the generalizability of the model. Artefact detection is based on the similarity between the visual words and the corresponding image patches (Fig. 1). An image patch with a small dissimilarity score can be interpreted as artefact-free only if the dictionary has already excluded all words with artefacts. Hence, being robust to outliers is crucial in the dictionary learning step and thereby common clustering algorithms like k-means are not appropriate. We propose a robust dictionary learning based on the fixed-width clustering algorithm, introduced for outlier detection in network systems [2].

Fig. 1.
figure 1

An example of optimal coverage and artefact detection as outliers in the dissimilarity score distribution.

2.1 Robust Dictionary Learning

The objective is to learn a Dictionary \(\mathcal {W} = \lbrace w^1, \cdots , w^N\rbrace \) with N visual words from a large pool of patches, capturing the normal shape and appearance variations in the image class while excluding outlier patches. An outlier patch is expected to lie in a sparse region of the feature space, i.e. raw intensity values here, having few or no neighbours within a typical distance r. Observe that outlier patches detected in this step cannot be used directly to identify image artefacts. Since images are not coregistered and patches are extracted from fixed locations, some proportion of outliers will be due to misalignments not necessarily representing an image artefact.

The proposed robust dictionary learning is as follows. Each image is divided into overlapping square patches of size k for 2D images, or cubic patches for 3D images, with an overlap of size \(k^\prime \) between neighbouring patches. The fixed-width clustering algorithm is then applied as follows. All the patches are shuffled randomly and the first patch would be the first visual word. For all the subsequent patches, the Euclidean distance of the patch to each visual word is computed. If a distance is less than r, then the patch is added to the corresponding cluster and its centroid is recalculated with the average of all members. Otherwise, the patch is added as the centroid of a new cluster. Observe that each patch may contribute to more than one cluster. Finally, clusters with only one member are considered as outliers and removed from the dictionary.

2.2 Optimal Image Coverage and Word Selection

A coverage of an image I is a selection of visual words placed at different locations in the image so that all pixels are covered at least once. Let us consider that the image I has P pixels and each visual word can be deployed at L locations indexed by \(\ell \in [1,L]\), where \(L \le P\) depends on the resolution with which the image is swept. The binary mask \(m_\ell \) represents the word location \(\ell \) in the image, \(d_\ell ^n\) denotes the word \(w^n\) placed at location \(\ell \) with appropriate zero-padding, and the binary variable \(z_\ell ^n\) encodes whether \(d_\ell ^n\) is selected to cover the image or not. Thus, the binary vector \(\mathbf {z}= \left( z_1^1, \cdots , z_L^N \right) \) would represent a coverage of the image if

$$\begin{aligned} \sum _{n, \ell } z_\ell ^n m_\ell \ge \mathbf {1}_{P\times 1}, \end{aligned}$$
(1)

where the left-hand side is an integer vector counting how many times each pixel is covered in the image.

We define the image coverage error as the \(L_2\)-norm of the difference between each selected visual word and the corresponding image patch,

$$\begin{aligned} E=\sum _{n, \ell } z_\ell ^n \left\| d_\ell ^n - m_\ell \circ I \right\| ^2 . \end{aligned}$$
(2)

Here, \(m_\ell \circ I\) denotes the component-wise product between the binary mask \(m_\ell \) and the image I. The optimal image coverage is defined by the minimization of the coverage error, \(\mathbf {z}={{\mathrm{argmin}}}_\mathbf {z}E\), subject to the constraint in Eq. 1.

Let us denote by \(z_\ell ^* = \sum _n z_\ell ^n\), the number of visual words placed at location \(\ell \). If two words, \(w^{n_1}\) and \(w^{n_2}\), are used at the same location \(\ell \) (\(z_\ell ^{n_1}=z_\ell ^{n_2}=1\)), then the coverage error will be always larger than using just one of them, without any effect on the constraint. Hence, the optimal solution will place at each location \(\ell \) either one single visual word (\(z_\ell ^*=1\)) or none (\(z_\ell ^*=0\)). Therefore, the optimization can be split into two independent problems. First, for each \(\ell \), the locally optimal visual word \(w^{n(\ell )}\) is selected by minimizing the local error,

$$\begin{aligned} E_\ell = \min _n \Vert d_\ell ^n - m_\ell \circ I \Vert ^2 . \end{aligned}$$
(3)

Then, the optimal locations, \(\mathbf {z}^* = (z^*_1, \cdots , z^*_L)\), are selected by minimizing the total coverage error,

$$\begin{aligned} \mathbf {z}^* = \mathop {\text {argmin}}\limits _{\mathbf {z}^*} \sum _{\ell } z_\ell ^* E_\ell \qquad \text {subject to} \quad \sum _{\ell } z_\ell ^* m_\ell \ge \mathbf {1}_{P\times 1} . \end{aligned}$$
(4)

Equation 4 can be efficiently solved using linear integer programming packages such as Matlab optimization toolbox (Mathworks Inc., Cambridge, MA).

2.3 Artefact Detection

For a given image, a dissimilarity score is computed between each representative visual word and its corresponding raw patch. Any image patch with an associated score above an optimal threshold will identify the location of an artefact in the given image. Observe that since matching of the words is local and the best fitting locations are found after an optimal coverage without forcing a priori known locations, images do not need to be previously registered.

Dissimilarity score: The local properties of an image can be described by the set of its derivatives, which is named as local jet [9]. For a given image I and a scale \(\sigma \), the local jet of order N at point \(\mathbf {x}\) is defined as

$$\begin{aligned} J^N[I](\mathbf {x}, \sigma ) \triangleq \lbrace L_{i_1, \cdots , i_n}(\mathbf {x}; \sigma ) \rbrace _{n=0}^N, \end{aligned}$$
(5)

where the \(n^{th}\)-order derivative tensors are computed by the convolution

$$\begin{aligned} L_{i_1, i_2, \cdots , i_n}(\mathbf {x}; \sigma ) = \left[ G^{(\sigma )}_{i_1, i_2, \cdots , i_n} * I \right] \!(\mathbf {x}), \end{aligned}$$
(6)

with the corresponding derivatives of the Gaussian kernel \(G^{(\sigma )}_{i_1, i_2, \cdots , i_n}(\mathbf {x})\), and \(i_k = 1,\ldots , D\), for D-dimensional images. For each derivative order, a complete set of independent invariants under orthogonal transformations can be extracted [9]. For 2D images and second order, for example, this complete set is formed by the intensity, the magnitude of gradients \(\sqrt{\sum _i L_i^2}\), the Laplacian \(\sum _i L_{ii}\), the Hessian norm \(\sum _{i,j} L_{ij} L_{ji}\), and the second derivative along the gradient \(\sum _{i,j} L_i L_{ij} L_j\). Multiresolution description can be obtained by changing the scale \(\sigma \) in a convenient set of scale-space representations. For each invariant feature, the Euclidean distance between the visual word and the corresponding image patch is used as the dissimilarity metric.

Optimum threshold: The optimum threshold for each dissimilarity score is computed as follows. For each image in the database, the maximum score among all the representative visual words is computed. The optimum threshold is selected as \(q_3+ \nu *(q_3-q_1)\), where \(\nu = 1.5\), and \(q_1\) and \(q_3\) are the first and third quartiles, respectively. An image is then artefact-free only if all the representative visual words have a dissimilarity score below the optimum threshold with respect to all the considered features.

3 Experiments and Results

Data specifications: We have tested the proposed algorithm on a dataset of hip DXA scans collected using a Hologic QDR4500 Acclaim densitometer (Hologic Inc., Bedford, MA, USA) during a previous pharmaceutical clinical trial [6]. DXA is the standard imaging technique for bone densitometry. The current software packages for DXA analysis requires manual interaction and thereby each scan should be analysed individually. Therefore, quality control is assumed to be done online during the analysis step almost accurately. Unlike the common perception, a survey of members of the International Society for Clinical Densitometry (ISCD) indicated that errors in DXA acquisition are not rare [4]. Although qualitative description of DXA artefacts is available [3], no quantitative classification nor manually-crafted features has been introduced so far. Thus, no prior knowledge of the types of artefacts can be assumed in DXA modality. This makes automatic DXA quality control an interesting challenge. A population cohort of 5592 women was included in the clinical trial [6], from which, 1300 scans were already available.

Experimental set-up: To evaluate the method, 300 scans were randomly selected for manual annotation. We learned the visual dictionary and the set of optimum thresholds on the remaining 1000 scans. Observe that the learning step is also unsupervised, thereby we could do the training based on the whole dataset and then validate the method based on a proportion of the dataset. However, to make a more rigorous evaluation, we split the dataset into training and test sets.

Performance measures: Sensitivity and specificity of the method are reported on the test data based on a priori manual annotation. The sensitivity is defined as the proportion of images with artefacts that are detected correctly by the algorithm. The specificity is defined as the proportion of normal images that are correctly labelled as artefact-free. The localization capacity of the algorithm is measured in term of the number of patches that localize an artefact correctly divided by the number of all detected patches on those images that are truly detected as artefacts. We will refer to it as the localisation rate.

Fig. 2.
figure 2

Samples from the dataset. Red squares show the location of detected artefacts.

Parameter selection: The patch size and the radius r are two parameters for the proposed method. Both parameters would be data dependent. The radius r is automatically selected estimating a typical small distance between patches in the same image: For each image, all pairwise distances between the patches comprising the image are computed. Next, \(\frac{1}{n}\)-quantile of these distances are computed per image, where n is the total number of patches extracted from each image. Then, the parameter r is selected as the median of the computed quantiles in the image dataset. The patch size could be estimated based on the size of the effect that is measured. For example, in femoral DXA bones, the diameter of the femoral stem is approximately 64 pixels. We have tested the results with patches of size 32 and 64 with 8 pixel overlap. No differences were observed in the sensitivity. We presented the results with patches of size 64. Thus, the total number of 24830 patches were extracted from 1000 images. The radius \(r = 3.5\) is estimated for this dataset. The obtained dictionary contained 1146 visual words.

We tested invariant features up to the second order. However, the second order features did not provide any new information. Hence, only intensity and gradient magnitude are finally used as features. The gradient magnitude for each image patch or visual word is normalized to have Euclidean norm one. Single scale analysis with \(\sigma = 0.2\) was used. Optimum thresholds are derived as 0.37 and 4.86 for the gradient magnitude and intensity, respectively.

Results: Eleven images out of 300 are manually annotated as artefacts. Nine out of eleven are detected using the proposed algorithm. Sensitivity and specificity are 81.82 % and 94.12 %, respectively. The localization rate is \(96\,\%\). Figure 2 shows normal images and artefacts. Only 2 out of 11 image artefacts are misclassified as normal. These two scans are characterized as movement artefacts that cause subtle vertical displacement in the image. However, the algorithm managed to successfully localize other types of artefacts including the existence of an external object (key-shape object in Fig. 2).

4 Conclusion

In this paper, we proposed a completely unsupervised and generic framework to address automatic quality control in a large cohort of medical image dataset. Based on the assumption that artefacts constitute a small proportion of the dataset, a dictionary-based framework based on an optimal coverage of images was introduced to detect and localize image artefacts as outliers to the normal image class. The method computational complexity is linear in the number of input images, providing good scalability to large datasets. We have tested the method on 1300 femoral DXA scans and reported good sensitivity and specificity on the dataset.