Keywords

1 Introduction

Recent needs [1, 2, 24] to reconstruct the neural connection of animal-brain-tissue calls for effective methods to collect numerous microscopic images of tissue slices. A efficient way [3] to collect microscopic images is to first cut biological samples into thousands ultra-thin slices, which have a thickness of about 30 nm to 70 nm, then put these slices on several silicon wafers, and use several SEMs to take images for these slices in parallel. Our laboratory collect microscopic images in this way too. This method did accelerate the data collection process, however, the speed is still not as fast as we expected. For example, in a typical high resolution image of an ultra-thin slice, each pixel correspond to a physical size of 2 nm, and it take 2 \({\upmu }\)s to collect each pixel. Therefore, a single tissue slice of size \(1\,\mathrm{mm}^2\) needs \(2.5 \times 10^{11}\) pixels to represent, which will spend a single SEM at least 138.8 h to collect. This means that taking images for ten thousand of ultra-thin slices needs one SEM continuously working at least 6.6 years! Using 6 SEMs working in parallel still needs more than a year. Undoubtably, further accelerating the collection process is of great significance, and obstructed by expensive cost, it is not feasible to increase the collection speed only by increase the number of microscopes. This work explore the way to speed up the collection of every single microscope by using the technology of image super-resolution and image denoise at the same time.

Single image super resolution (SISR) tries to increase the image resolution, and recover the detailed information as much as possible. A basic assumption of image super resolution is that high resolution images contain much redundant information, and thus can be recovered from low resolution images. According to [4], the linear degradation model of SISR is formulated by:

$$\begin{aligned} \mathbf {z} = D_sH\mathbf {x} \end{aligned}$$
(1)

where \(\mathbf {z} \in \mathbf {R}^{M \times N} \) is the input low resolution (LR) image, \(\mathbf {x} \in \mathbf {R}^{Ms \times Ns}\) is the unknown high resolution (HR) image, linear operation \(H \in \mathbf {R}^{MNs^2 \times MNs^2}\) blurs the image, and operation \(D_s \in \mathbf {R}^{MN \times MNs^2}\) decrease the image scale by a factor s. Many recent work trying to learn a mapping from low resolution images to high resolution images with extra data set, and the state-of-the-art mapping function usually parametrized with deep neural networks.

It is possible to use the technology of image super-resolution and image denoise to accelerate the collection process of SEM-images. Automatic analysis of biological-tissue-image require the image contains enough pixel for each organelle, however, as discussed before, the collection process of high resolution SEM-images is really time-consuming. In practice, an effective way to accelerate collection process is to take images in a relatively lower resolution, and then increase the image resolution by image super-resolution algorithms. Reducing half of the imaging resolution can accelerate four times of the imaging speed. In order to balance the conflict between collection speed and final image quality, this work only consider to upscale the collected image with factor of 2.

The most difficult problem in our application is to get training dataset. Traditional image super-resolution problem do not consider the disturb of random noise, while SEM-images usually be notoriously noised, which means we cannot just use pretrained image-super-resolution-model to do our job. Besides, traditional SISR algorithms usually down-sample HR images to create training sample, but SEM-images that collected with difference parameter would have difference noise level, which means we cannot create the training dataset in a similar way. A Possible solution is to take images in the same place with different resolutions to create the training samples. However, as observed in the experiment, electron beam will heat the tissue slices in the collecting process, and cause some physical deformation to the biological sample. As a result, even collected in the same place with same parameters, two images would have huge difference on pixel level. This change does not affect biological meaning, but really obstruct the training of image-super resolution model.

Considering the SEM-image super resolution problem, the noise usually come from irregular scattering electrons, thus it is okay to think the noise is “adding” into the true image. Then we can formulate the degradation model as:

$$\begin{aligned} \mathbf {z} = D_sH(\mathbf {x} + N) \end{aligned}$$
(2)

Where \( \mathbf {z} \in \mathbf {R}^{M \times N} \) is the collected low resolution (LR) image; \(\mathbf {x} \in \mathbf {R}^{Ms \times Ns}\) is the unknown high resolution (HR) image, which we want to restore from the collected LR image \(\mathbf {z}\). Linear operation \(H \in \mathbf {R}^{MNs^2 \times MNs^2}\) and operation \(D_s \in \mathbf {R}^{MN \times MNs^2}\) represent the degradation process as Eq. (1). The \(N \in \mathbf {R}^{M\times N}\) is the added noise. This formulation can be further written as:

$$\begin{aligned} \mathbf {z} = D_sH\mathbf {x} + G \end{aligned}$$
(3)

In which \(G = D_sHN\). Following this formulation, we can add some noise to the down-sampled HR images to create our training image pairs. This method can be easily applied to bridge the gaps between the noisy and clean images, at the same time, the distortion between low resolution images and high resolution images is no longer a problem. In this work, we trained a generative adversarial network [6] to fit the noise of low resolution SEM-images, and then use the generator to add the learned noise to down-sampled high resolution image patches. This methods can help us build suitable dataset for our training and validation.

Overall, the contributions of this work are mainly in three aspects:

  1. 1.

    A pipeline has been proposed to accelerate the SEM-image collection of large scale biological-tissue imaging.

  2. 2.

    A GAN has been designed to produce high quality training samples for the super-resolution and denoise problem of SEM-images.

  3. 3.

    We trained an end-to-end fully convolutional network to solve the SEM-image super-resolution problem.

2 Related Work

2.1 Example-Based Image Super Resolution

Example-based image super resolution method learns a mapping from LR image patches to HR image patches. Traditionally, as summarized in [9], example-based SISR methods employ hand-designed feature, hand-designed upscale, hand-designed nonlinear mapping to reconstruct the HR image from the input LR image. SRCNN [10] merged the process of feature extraction, nonlinear mapping, and reconstruction into a single model, training a simple three-layer fully convolutional network to perform feature extractor, nonlinear mapping function, and HR image reconstructor at the same time. In their work, they need to first upscale LR image into desired scale, and then restore detailed information with the three-layer model. Later works such as VDSR [11] build a deeper convolutional network to learn a better mapping. Recent works try to merge the upscale process into the super resolution model, and train an end-to-end model to solve the SISR problem. EEDS [9] employed transposed convolutional layer to increase the image scale, while ESPCN [12] use self-designed sub-pixel convolutional layer to bridge the gap between different scale. In a further development, SRGAN [13] solve the SISR problem with generative adversarial network, using a discriminator to push the generator learn a nature image manifold, which can produce more vivid HR images.

2.2 Image Denoise

Image denoise is a low-level image processing problem, which plays a fundamental role in many computer vision applications. BM3D [14] exploited a nonlocal image modeling by a 3D grouping, followed by a collaborative filtering through transform-domain shrinkage of the 3D-array. Later, Burger et al. [5] employed a simple multi-layer perceptron to learn the mapping from noisy images to a noise-free images. Recently, Chang et al. [15] found that when the data is enough, one single network can do multi-tasks such as image denoise, image super resolution, and many other tasks separately, which illustrated that data-driven methods can solve many image recovery problems well.

2.3 Generative Adversarial Networks

Generative Adversarial Networks (GAN) [6] is a technique to training generative models that can approximate image distribution via a two player adversarial game, and has been shown to generate high quality images [7, 16,17,18]. In this framework, the generator (G) tries to generate realistic-looking images to deceive the discriminator (D), while the D learns to discriminate real images and the generated images. Actually, the D in this framework plays a role of loss function, and tell the G how to generate more lifelike pictures. The training process of GAN can be formulated as follows:

$$\begin{aligned} \mathop {\min }\limits _{G}\mathop {\max }\limits _{D}V(D, G) = \mathbf {E}_{x \sim p_{data}(x)}[logD(x)] + \mathbf {E}_{z \sim p_z(z)}[log(1 - D(G(z)))] \end{aligned}$$
(4)

where \(p_{data}(x)\) is the distribution of nature image, and z is some random noise. Typically, a generator take a fix dimension random noise as input, which then projected to realistic-looking images.

One of the most important property of GAN is that this deep generation model do not need explicit formula. Thus, it is possible to use GANs to estimate the noise distribution of SEM image, and then reproduce realistic training samples for our model.

3 Sample Simulation and SEM-Image Super Resolution

3.1 Create High Quality Training Samples with GAN

We designed a Generative Adversarial network (GAN) to estimate the noise distribution of SEM-images, and then the generator is used to add the learned noise to down-sampled HR image patches and generate LR patches. We refer this model as Noise-GAN in later discussion. It may be possible to add random noise to the down-sampled HR image patches to produce the fake LR image patches, but it is hard to precisely estimate the distribution of noise. Thus, this work explore to fit the noise distribution with GAN.

In the practice, our work down-sample the HR image patches at first, and then feed these patches into the generator to add the generated noise, which makes these patches look real. In the training process, the discriminator tries to distinguish the true LR patches and the fake LR patches, while the generator tries to cheat the discriminator. After 30 epochs’ training, the generator is good enough to produce realistic-looking LR patches, and even human cannot distinguish the difference between true LR patches and fake LR patches.

The structure of Noise-GAN follows the principle of DCGAN [7] and W-GAN [8]. The generator projects a 128-d random noise to a noise map of \(64 \times 64\), which are additional added with the input image patch to finally generate the fake LR patch. The generated fake LR patches and the truly collected LR patched then be used to train the discriminator. A little detail need to notice is that in every training batch, the fake LR image patches and the true LR image patches were collected in the same place, thus they have the same semantic information. Besides, these patches need to subtract their mean value before feed into the discriminator to avoid the disturb of luminance. With these method, the only difference between the fake patches and the true patches is the noise level, and will prevent the discriminator classify these patches with other information. This process will make sure the GAN learns to only fit the noise. The tanh activation rather than relu activation was used in the generator, and we use RMSProp optimizer with learning rate of 0.00001 to train this model. The network is illustrate as Fig. 1.

Fig. 1.
figure 1

The Noise-GAN structure: (a) The generator, which add generated noise to the HR image patches; (b) The discriminator, which is training to distinguish the true LR image patches and the generated LR patches.

In Fig. 2, we compare the generated LR patches with true LR patches and the down-sampled HR patches. These patches was cropped from a big picture of our SEM-image data set. We will introduce this data set in Sect. 4.

Fig. 2.
figure 2

The comparison between generated patch and real patch: (a) Real HR image patch, which have a pixel size of 1 nm; The HR patch was down-sampled to have the same scale with (c). (b) Generated fake LR patch, which use (a) as input; (c) Real LR patch, which was collected with pixel size of 2 nm.

3.2 End-to-End SEM-Image Super Resolution

We designed a deep network to verification the final result. This network may not works as well as other state-of-the-art image super resolution methods, but it is simple to build, easy to train, and have enough capacity to learn the mapping from noised LR patches to HR patches. We call this network as SESR, which means SEM-images super resolution. This network can easily get average PSNR of 36.038 dB on Set5 after 13000 interactions’ training on the 91-image dataset.

The architecture of our network described in Fig. 3(c) was inspired by inception model [19,20,21], ResNet [22], and EEDS [9]. This model have structures to capture multi-scale information, and use skip connect to transfer low-level information to the upper layer. Block1 as illustrated in Fig. 3(a) have four sub branches, and each branch have different receptive filed changed from 1 to 7 to capture features in different scale. These extracted features then be concated and processed by a convolutional layer. Block2 works in a similar way, big convolution kernels are decomposed into small asymmetric convolutions for computational efficiency. Block1 and Block2 can be seen as assemble models, and the features extracted with difference scale are assembled together for a further prediction. A \(5 \times 5\) transposed convolutional layer are used to increase the image scale. Besides, as discussed in [21, 22], skip connection used in this model can also ensure the network converges faster. The res-block in this model is the same with ResNet [22].

In the training process, we use mean squared error as the object function, and use Adam [23] optimizer to train this network, with a learning rate of 0.0003. 10 thousands of interactions is enough to train this network to convergence. The network structure is illustrated as Fig. 3.

Fig. 3.
figure 3

The structure of SESR. The four sub branches in (a) and (b) are used to capture features in different scale, and the skip connection in these blocks are used to pass low frequency information to upper layer in forward propagation, and speed up the training in back propagation.

Compared with two-stage methods such as SRCNN and VDSR, this model can proposes images end to end, and compared with other one-stage methods such as EEDS and ESPCN, this methods is light-weighted. The storage size of the parameter is less than 1 MB.

4 Experiments and Results

The SESR model was trained with tree group of training data in this experiment to verification the effect of the generated training set. To the best of our knowledge, there are no previous work trying to deal with the SEM-image super resolution problem, so we compare the result with other two possible methods.

4.1 SEM-Image Dataset

The SEM-images used in this work was collected form the Zeiss Super55 Electrical microscope of our laboratory. We take images in the same place with pixel size changed from 1 to 8 nm, corresponding a image size changed from \(8192\,\times \,8192\) to \(1024\,\times \,1024\). Because of the large size of these images, 8 images of each size are enough for training, validation, and testing. In the experiment, we crop 3 images for training, 1 image for validation, and 4 images for testing.

4.2 Training and Testing

The SESR model has been trained with three group of training data to illustrate the importance of the generated training data in our application. The fist group is the 91 images as mentioned before, we denote this dataset as “-NT”. The second dataset is the real collected SEM dataset, we denote this dataset as “-SEM”. The image which have a pixel size of 1 nm was down-sampled to the correspond scale to act as the ground truth, and then down-sample the ground truth image to generate the LR image patches as input. The third dataset is also the SEM dataset, but the difference with the second dataset is that this dataset use the generated LR image patches as input. We use “-GAN” to denote this dataset.

We testing these model with the real collected LR patches as input. It is important to notice that because of the distortion and deformation in the collection process, the output result need to be registered with the ground truth to be measured quantitatively. In order to avoid the difference made by the registration algorithm, for every input image, the output result of each method are registered to the ground truth with the same transformation. Then, the peak signal to noise ratio (PSNR) was used to measure the final result.

Table 1. Average results on test set. Because of the results are measured after registration, the final PSNR are a little lower than many other application.

We compare the final result with other two possible methods. One is to remove noise with BM3D at fist, and then enlarge the image size with Bicubic algorithm, the other method is to upscale the image with Bicubic algorithm at first, and then remove the noise with BM3D. Because we do not know the noise pattern precisely, we estimate the \(\sigma = 25\) for the BM3D algorithm. This two kind of methods illustrated different results in our dataset. We illustrated the average PSNR of these results as Table 1.

Fig. 4.
figure 4

Final results and comparison with other methods

A group of results is illustrated as Fig. 4. From Fig. 4(f), (i) and (k), we can see that methods trained with nature images are disturbed by the noise. From Fig. 4(d) we can see that using BM3D to remove the noise after bicubic destroy some detailed structure. From Fig. 4(g) and (h) we can see that denoising with BM3D at first can inhibit the noise well, however blurred the final result. And the similar appearance between Fig. 4(d) and (i) in return illustrate that the generated training dataset matters. In all, SESR model trained with the generated training set performs better than other methods, and the final result looks more similar than others. The images predicted by BM3D+Bicubic, although have high PSNR on the test set, blurs the detailed structure of the image, which is not expected in the real application. Although the results of our model also looks a little blurred, this pipeline do illustrated a promising way to really solve our problem.

5 Conclusion

This work proposed a pipeline to deal with the SEM-images super resolution problem. SEM-image super-resolution algorithms have a high potential to accelerate the collection speed of SEMs, and finally break through the bottleneck of obtaining high volume biological tissue image sequence. However, since the noise in SEM-images can not be neglected, SEM-image super resolution problem cannot be treated as the way of ordinary SISR problem. The most difficult part in this application is to get suitable training data to train our model. This work first analyzed the difference between SEM images and ordinary images, and then designed a GAN to fit the distribution of the noise in SEM images. After that, the generator can be used to generate training samples for the final training process. Comparing with other possible solution, this training-sample-generation method performs better, and still have much potential to improve.