Keywords

1 Introduction

Image denoising aims to recover a latent clean image \(\mathbf {X}\) from its degraded observation \(\mathbf {Y}\), which is modeled as \(\mathbf {Y} = \mathbf {X} + \mathbf {V}\) where \(\mathbf {V}\sim \mathcal {N}(0,\sigma ^2)\) denotes Gaussian noise with zero mean and standard deviation \(\sigma \). It is an ill-posed inverse problem because noise prevents the full recovery of image details. Consequently, prior information is used to regularize the problem: internal reference in nonlocal self-similarity [1, 2], arbitrary priors in sparse representation [3, 4], or estimated priors in deep-learning [5, 6]. In patch-based image denoising, each image is partitioned into a set of overlapping patches, and the degradation model is written on each patch i as \(\mathbf {y}_i= \mathbf {x}_i+ \mathbf {v}_i\). Without loss of generality, we can represent image patches in the vector space generated by a dictionary \(\mathbf {D}\in R^{m\times M}\), i.e. a basis set of M basic vectors (also called atoms) of size m. For each image patch \(\mathbf {y}_i\), our objective is to seek a vector \(\varvec{\alpha }_i\in R^M\) such that the clean latent patch \(\mathbf {x}_i\) verifies \(\mathbf {x}_i= \mathbf {D}\varvec{\alpha }_i\). Using MAP we have:

$$\begin{aligned} \hat{\varvec{\alpha }}_i = \mathop {\mathrm {arg\,max}}_{\varvec{\alpha }_i} \log (p(\varvec{\alpha }_i|\mathbf {D},\mathbf {y}_i)) \propto \mathop {\mathrm {arg\,min}}_{\varvec{\alpha }_i} \{||\mathbf {y}_i-\mathbf {D}\varvec{\alpha }_i||_2^2 - \lambda \log (p(\varvec{\alpha }_i))\} \end{aligned}$$
(1)

where \(\lambda = \tau \sigma ^2\) (with \(\tau > 0\)) controls the amount of regularization.

We consider a Gaussian Mixture Model (GMM) to model the distribution of image patches, i.e. \(p(\varvec{\alpha }_i) = \sum _{k=1}^{K}\pi _k \mathcal {N}(\varvec{\alpha }_i| \varvec{\mu }_k,\varvec{\varSigma }_k)\), where \(\pi _k\) are the mixing weights, \(\varvec{\mu }_k\), \(\varvec{\varSigma }_k\) are the mean and covariance matrix of the \(k^{th}\) component. The model can be estimated from a set of image patches [7] or a collaborative group of patches [8, 9] extracted from standard images. However, solving Eq. (1) with the complete mixture is time-consuming and, to our knowledge, existing methods only use one prominent component for each image patch and justification for this is lacking. In this contribution, we divide the patches in an input image into a set \(P_1\) of simple patches with a prominent component and a set \(P_2\) of the remaining patches. We focus on the set \(P_2\) and conduct multiple experiments to show that only marginal gains can be obtained by considering the full GMM in denoising. We explore different choices of dictionary (identity matrix and K-SVD based) and two choices of GMM complexity on PSNR and reconstruction error and discuss the type of images that are difficult to reconstruct.

The rest of the paper is organized as follows. Section 2 gives a brief introduction to the datasets. The details of the Gaussian mixture model as Prior for Image Denoising (GPID) method are described in Sect. 3. Section 4 presents the experimental results and discussion.

2 Datasets

We explore the following datasets, with different image types and structures:

  • Cartoon [10] contains 590 images of popular cartoon characters. We choose 45 images to train a GMM and 80 images for evaluation.

  • Urban [11] contains urban scenes with high self-similarity and many repeated patterns. We use 25 images for training and 25 images for denoising.

  • Nature We use 200 training images in [12] and 20 popular natural test images presented in [8].

  • Brodatz [13] contains 112 grayscale images of natural textures. We select 30 good quality and content-rich images and split each of them into 4 non-overlapping sub-images. 90 sub-images are used for training the GMM and 30 sub-images for denoising.

  • Dtd [14] contains textural images in the wild such as band, braid, spiral, grid, etc. We choose 55 images for training and 40 images for denoising.

  • CT of Thorax and CT of Lung We download 7 sequences of CT lung images and 12 sequences of CT thorax images from [15]. 40 thorax images are used for training the GMM and 40 images for testing. The numbers of images for training and testing of CT images of Lung are 40 and 60.

  • MRI Brain We download 16 sequences of MRI brain images from [16]. 80 images are selected from 7 sequences for training and 60 images are chosen in other sequences for denoising.

Fig. 1.
figure 1

Examples from the 8 datasets, test image with \(P_2\) patches in red (left), PSNR as a function of L (right). Captions indicate image name, percentage of \(P_1\) patches and average reconstruction error \(||\hat{\mathbf {X}}_{L=1}-\hat{\mathbf {X}}_{L=5}||_{L_1}\). (Color figure online)

3 Image Denoising with a Gaussian Mixture Model

3.1 Training the GMM on a Patch Database

From the training set of good quality noise-free images, we randomly extract N patches \(\mathbf {x}_j\) (\(1 \le j \le N\)) of size \(\sqrt{m} \times \sqrt{m}\). After mean subtraction, each patch \(\mathbf {x}_j\) is encoded in the vector space generated by M atoms of dictionary \(\mathbf {D}\), \(\mathbf {x}_j = \mathbf {D} \varvec{\alpha }_j\) by least squares fitting \(\varvec{\alpha }_j= (\mathbf {D}^T\mathbf {D})^{-1}\mathbf {D}^T \mathbf {x}_j\).

The probability distribution of the patch coordinates \(\varvec{\alpha }_j\) can be modeled by a GMM of K components, \(p(\varvec{\alpha }_j)=\sum _{k=1}^{K}\pi _k\mathcal {N}(\varvec{\alpha }_j|\varvec{\mu }_k,\varvec{\varSigma }_k)\). We can get the parameters \(\{\pi _k, \varvec{\mu }_k,\varvec{\varSigma }_k\}_{k=1}^K\) by maximizing the likelihood function via Expectation-Maximization algorithm. In the E-step, the expected value of the conditional probability of \(\varvec{\alpha }_j\) given the parameters of GMM (also called the “membership probability”) is computed as in (2).

$$\begin{aligned} \gamma _{jk} = \frac{\pi _k\mathcal {N}(\varvec{\alpha }_j|\varvec{\mu }_k,\varvec{\varSigma }_k)}{\sum _{l=1}^{K} \pi _l \mathcal {N}(\varvec{\alpha }_j|\varvec{\mu }_l,\varvec{\varSigma }_l)} \end{aligned}$$
(2)

In the M-step, the parameters of Gaussian components are updated using (3)

$$\begin{aligned} \varvec{\mu }_k =\frac{\sum _{j=1}^{N} \gamma _{jk} \varvec{\alpha }_j}{\sum _{j=1}^{N} \gamma _{jk} } ; \varvec{\varSigma }_k = \frac{\sum _{j=1}^{N} \gamma _{jk} (\varvec{\alpha }_j-\varvec{\mu }_k)^T(\varvec{\alpha }_j-\varvec{\mu }_k) }{\sum _{j=1}^{N} \gamma _{jk} }; \pi _k = \frac{\sum _{j=1}^{N} \gamma _{jk} }{N} \end{aligned}$$
(3)

3.2 Denoising Algorithm

After mean substraction, denoising a patch \(\mathbf {y}_i\) is equivalent to finding the optimal clean latent patch \(\hat{\mathbf {x}}_i = \mathbf {D}\hat{\varvec{\alpha }}_i\). Solving problem (1) with the whole GMM of \(p(\varvec{\alpha }_j)\) is a very time-consuming process. To overcome this issue, existing studies [7,8,9] propose to assign the noisy patch \(\mathbf {y}_i\) to a single Gaussian component according to the posterior probability:

$$\begin{aligned} \gamma _{il} = \frac{\pi _l \mathcal {N}(\mathbf {y}_i| \mathbf {D}\varvec{\mu }_l, \mathbf {D}\varvec{\varSigma }_l\mathbf {D}^T)}{\sum _{n=1}^{K}\pi _n \mathcal {N}(\mathbf {y}_i| \mathbf {D}\varvec{\mu }_n, \mathbf {D}\varvec{\varSigma }_n\mathbf {D}^T)} \end{aligned}$$
(4)

where \(0 \le \gamma _{il} \le 1\) and \(\sum _{l=1}^{K} \gamma _{il} = 1\). For convenience, we assume that \(\gamma _{i1} \ge \gamma _{i2} \ge \ldots \ge \gamma _{iK}\). Problem (1) has a closed form solution when using only \(\gamma _{i1}\)

$$\begin{aligned} \hat{\varvec{\alpha }}_i = \left( \mathbf {D}^T\mathbf {D} + \lambda \varvec{\varSigma }_1^{-1}\right) ^{-1} \left( \mathbf {D}^T \mathbf {y}_i+\lambda \varvec{\varSigma }_1^{-1}\varvec{\mu }_1 \right) \end{aligned}$$
(5)

Typically, this approach is acceptable when the first component is dominant and the other components do not contribute much to the optimization. In practice, we define the set of dominant patches \(P_1 = \{ \mathbf {y}_i\, s.t. \, \gamma _{i1}\ge 0.9\}\) and we call \(P_2\) the set of the remaining patches. \(P_1\) patches are restored via (5) whereas the patches in \(P_2\) are restored by considering the largest L components of the GMM. Consequently, in the following, we only solve the simplified problem (6). The denoising method is presented in Algorithm 1.

$$\begin{aligned} \varvec{\alpha }_i= \mathop {\mathrm {arg\,min}}_{\varvec{\alpha }_i} ||\mathbf {y}_i-\mathbf {D}\varvec{\alpha }_i||_2^2 -\lambda \log \left\{ \sum _{l=1}^{L}\pi _l\mathcal {N}(\varvec{\alpha }_i|\varvec{\mu }_l,\varvec{\varSigma }_l) \right\} \end{aligned}$$
(6)
figure a

3.3 Complexity Analysis

The denoising method GPID consists of two parts: off-line training and denoising. In the training phase, the overall complexity to learn the GMM of K components from N decomposition vectors \(\varvec{\alpha }_j\in R^M\) is \(O(KM^3N)\). In the denoising process, \(P_1\) patches are restored via (5) that needs \(O(M^2P_1)\) operations, and \(P_2\) patches (\(P_2 \approx 10\%P-20\%P\)) are recovered using gradient descent with the complexity \(O(LT_{gd}M^3P_2)\), where \(T_{gd}\) is the number of iterations of the gradient descent algorithm. The computation of the membership probabilities requires \(O(m^3PK)\) operations. The denoising step is repeated T times and therefore totally takes \(O(m^3PKT + M^2P_1T + LT_{gd}M^3P_2T)\) complexity.

4 Experimental Results

To show the effect of the restriction to the dominant component, we examine the performance of the GPID method on \(P_2\) patches with a varying number of components \(L \in \{1, 5, 10, 15, 20\}\) in step 7 of the optimization algorithmFootnote 1. We study the differences in peak signal-to-noise ratio (PSNR) and mean gray-level reconstruction error for the 8 datasets presented in Sect. 2, for the identity dictionary \(\mathbf {D}=\mathbf {I}\) and a K-SVD dictionary, and for small (\(K = 20\)) and large (\(K = 200\)) numbers of components.

In all experiments, we degrade the images from the database with white Gaussian noise with standard deviation \(\sigma = 30\). We train the two GMMs for each dataset on \(N=2.10^6\) randomly extracted patches of size \(m = 8 \times 8\). In the GPID denoising method, we use \(T=5\) regularizing iterations with \(\eta = m^2/\sigma ^2\), \(\beta = [1,4,8,16,32]/\sigma ^2\). We set \(\lambda = 0.9\sigma ^2\) and 1000 maximum iterations in gradient descent optimization. All the experiments are implemented in the Matlab 2013a environment on a machine with Intel Core i7-4770K CPU of 3.5 GHz and 16 GB of RAM.

From the examples in Fig. 1, we notice that \(P_2\) patches can usually be found close to the edges or contours. We also compute the PSNR values obtained for the GPID method as a function of L. On these examples, only modest gains can be obtained by considering several components in the reconstruction. These properties are explored further by computing the distributions of PSNR gains and reconstruction error.

4.1 Denoising Performance

When using the identity matrix as a dictionary, image patches are denoised without transformation. Note that the GPID method coincides with the method EPLL proposed by Zoran and Weiss in [7] when \(L=1\). We first observe that most image patches correspond to a single dominant component from Fig. 2. As expected, more complex images such as textures (Brodatz and Dtd datasets) require more components and have more \(P_2\) patches.

We compute the PSNR for \(P_2\) patches for \(L \in \{1, 5, 10, 15, 20\}\) and study the distribution of the maximum improvement \((\max _L PSNR) - PSNR_{L=1}\).

As shown on Fig. 2, for five datasets (Cartoon, Urban, Nature, CT Thorax and MRI), the maximum improvement is negligible, less than 0.1 dB for all test images except one from Nature. For complex images such as textures (Brodatz and DTD datasets) and the CT Lung images, some images can be modestly improved, up to 0.2 dB for \(K=20\). PSNR gains larger than 0.2 dB are only observed for complex images (Dtd and CT Lung), with \(K=200\), i.e. with enough components in the GMM to model distribution details and only for a small fraction of images (around 20%). These gains require around 200 s for a \(256 \times 256\) image, whereas only 10 s are needed for one Gaussian component.

We also analyze the reconstruction error on the central pixel of \(P_2\) patches \(||\hat{\mathbf {X}}_{L=1}-\hat{\mathbf {X}}_{L=5}||_{L_1}\) on Fig. 2. For all images, this difference is less than 2 gray levels per pixel, and cannot be seen by eye.

Fig. 2.
figure 2

Denoising performance for the identity dictionary

For each dataset, we learn an over-complete dictionary \(\mathbf {D}\) with 256 atoms as in [3]. Figure 3 shows a similar situation as Fig. 2. Most patches in the test images belong to \(P_1\), and most images can be reconstructed with only one component with a penalty less than 0.1 dB. PSNR gains larger than 0.2 dB can only be observed for a few complex images in the Dtd and Urban datasets. Gray-level differences are lower than for the identity dictionary, around 1.1 gray-levels per pixel.

Fig. 3.
figure 3

Denoising performance for the K-SVD dictionary with 256 atoms

4.2 Dictionary Choice and Model Complexity

Using the denoising results from the 8 datasets with \(L=1\), we compare the two dictionaries and GMM sizes in Fig. 4 (see also the examples in Fig. 1). We observe that increasing the GMM model complexity is nearly always beneficial, sometimes up to 2 dB PSNR gains, and that the K-SVD dictionary tends to benefit more from \(K=200\). The K-SVD dictionary yields slightly better PSNR especially for large GMM models overall, but the results are variable, which implies that dictionary choice is largely image-specific.

Fig. 4.
figure 4

Effect of model complexity (left) and dictionary choice (right) for \(L=1\)

In this manuscript, we consider only two sizes for the dictionary \(\mathbf {D}\). When \(\mathbf {D}\) is the identity matrix, its size corresponds to the patch size m. With the K-SVD method, we obtain an overcomplete dictionary \(\mathbf {D}\) with 256 vectors. In both cases the dictionary determines the basis set for representing image patches and is sufficiently rich to represent all image features. Dictionaries of the same size would yield the same results due to basis rotation. Our experiments suggest that complex GMM models can take advantage of the additional degrees of freedom of an overcomplete dictionary to model image details.

5 Conclusion

This paper studies the number of useful components in the GMM for patch-based image denoising on 8 image datasets. We first remark that most of the patches in an input image are well represented by a single prominent component. By exploring denoising with increasing number of components \(L \in \{1,5,10,15,20\}\), we show that only modest gains can be obtained in terms of PSNR and L1 reconstruction error (gray-level differences) in all datasets when using more than one component. This justifies current practice and drastically reduces computational cost. Much larger improvements can be obtained with a suitable dictionary and GMM model, but reconstruction only requires a single component.