Elsevier

Neurocomputing

Volume 505, 21 September 2022, Pages 188-202
Neurocomputing

STDIN: Spatio-temporal distilled interpolation for electron microscope images

https://doi.org/10.1016/j.neucom.2022.07.037Get rights and content

Abstract

Recently, flow-based approaches have shown considerable success in interpolating video images. However, in contrast to video images, electron microscope (EM) images are further complex due to noise and severe deformation between consecutive sections. Consequently, conventional flow-based interpolation algorithms, which assume a single offset per position, are not able to robustly model the movement of such complicated data. To address the aforementioned problems, this study propose a novel EM image interpolation framework that accommodates a range of offsets per location and further distills the intermediate features. First, a spatio-temporal ensemble (STE) interpolation module for capturing the missing middle features is presented. The STE is subdivided into two modules: temporal interpolation and residual spatial-correlated block (RSCB). The former predicts the intermediate features in two directions with several offsets at each location. Moreover, the RSCB uses the correlation coefficients for aggregated sampling. Thus, even if intermediate features are severely deformed, the STE effectively improves their accuracy. Second, a stackable feedback distillation block (SFDB) is introduced, which enhances the quality of intermediate features by distilling them from the input, and interpolated images, using a feedback mechanism. Extensive experiments demonstrate that the proposed method presents a superior performance compared with previous studies, both quantitatively and qualitatively.

Introduction

The temporal information contained in consecutive images allows an increase in z-axis resolution and thus an improvement in the quality of volume reconstruction (see Fig. 1), when imaging with an electron microscope (EM). For example, images with a 4 nm z-axis resolution are interpolated to achieve a 2 nm resolution. High-resolution serial slices containing fine z-axis motion dynamics [1] build superiorly refined biological structures, which facilitates intelligent data analysis tasks in biology. Furthermore, significant surface imperfections are restored through EM image interpolation. The aforementioned illustrates the social importance of EM interpolation in microscopic imaging.

Series section imaging records the continuity of biological tissues along the z-axis, while video streams reflect the temporal progression of events. Compared with general video streams, biological serial slices have fundamental differences in appearance, including grayscale, more prominent edges, and content patterns. These distinctions stem from the SEM imaging system and impact the application of common motion estimation and compensation algorithms. Fig. 2 illustrates the distinctions between EM and natural images. First, neither grayscale nor sRGB images in natural scenes reveal the edges of objects. In contrast, EM images in grayscale hold visible tissue edges. Second, the content patterns between natural and EM images are dramatically different. EM images are dominated by simple patterns, such as membrane structures, mitochondria, and vesicles, whereas the former contains a more diverse range of textures and substances. Lastly, as highlighted in colorful boxes, the motion trend in natural images is traceable and observable. For EM images, a z-axis resolution of just 10 nm for EM images causes massive and chaotic deformations. In addition, the EM imaging system is more complex and less stable than the optical CCD, resulting in a low signal-to-noise ratio and unstable image quality. In summary, EM images exhibit more complex deformation patterns. Consequently, a single offset per pixel in the flow estimator [5], [6], [7], [8] is inadequate for describing complicated motion in EM images.

In recent years, the use of deep convolutional neural networks to interpolate video images has shown promising results. In this setting, early researches [9], [10] estimate the spatial adaptive convolution kernel per pixel, and employ separable strategies to reduce the model’s capacity. However, the motion construction with adaptive kernels is as well one of the reasons they perform inadequately in complex scenarios. The deep optical flow [5], [6], [7], [8] computes the motion relationships between frames with high accuracy. The following studies [11], [12], [13], [14], [3] interpolate video images using deep optical flow estimation, resulting in visually convenient results. However, the deep optical flow predicts a single offset for each location and warps the pixel at that point to conform to the predicted offset. Due to their limited modeling capabilities, flow-based interpolation algorithms perform inconveniently on complicated EM images, resulting in severe smoothing and artifacts. For motion estimation, deformable convolution [15], [16] provides a novel solution, which calculates multiple offsets for each position. Numerous researches [17], [18] employ deformable convolution rather than deep flow estimation to achieve implicit temporal alignment. Recently, [4] integrated deformable convolution and pyramid features [19], [20] into frame interpolation and proposed a module for interpolating pyramid temporal features. However, the feature temporal interpolation module only captures the temporal context without incorporating joint spatial correlation, leading to the degradation of intermediate features. Furthermore, the lack of spatial rectification accentuates the temporal mismatch when large motions occur. The ConvLSTM [21], used for large movements, presents significant model parameters as well as a slow run time.

To address the aforementioned shortcomings, this study proposes an efficient spatio-temporal distilled interpolation network (STDIN) for EM images. The feature spatio-temporal ensemble (STE) module handles the dynamic background and predicts the interpolated features accurately. Specifically, the STE captures pyramid temporal features and calculates the spatial correlation coefficients. Afterward, the final interpolated features are synthesized using region sampling. Although the intermediate features with dual-domain embedding show promising performance in motion processing, the interpolated features still exhibit multiple mismatches when encountering severe anisotropy and large motions. Moreover, when using interpolated features as a reference and subsequently extracting from the input features the relevant ones, the incorrect prediction of intermediate features is exacerbated due to the interference of background noise and image quality fluctuations. Therefore, this paper proposes a lightweight, stackable feedback distillation block (SFDB) to purify intermediate features and minimize the temporal mismatch caused by large deformations. The SFDB module adapts the feedback distillation in response to input features. Moreover, this study discovers that the feedback distillation correction is stackable. Hence, the resulting intermediate features become more precise as the number of stacked modules increases.

The contributions of this paper are summarized as follows:

  • 1.

    The work of video frame interpolation is extended to electron microscopy and proposes a simple but effective framework for interpolating EM images. This approach incorporates spatio-temporal ensemble sampling and feedback distillation. Subsequently, the interpolated frames generated from EM images are more precise than those generated by previous interpolation algorithms.

  • 2.

    A spatio-temporal ensemble module that includes temporal context and spatial correlated information is presented. Moreover, a novel feedback distillation module is introduced, which enables the acquirement of the best aligned intermediate features under the supervision of input images.

  • 3.

    Extensive experiments demonstrate that this approach achieves state-of-the-art performance on the EM benchmark datasets and outperforms the recent best frame interpolation algorithms.

Section snippets

Video Frame Interpolation

Video frame interpolation (VFI) is a technique that uses input frames to predict non-existent intermediate frames. [22] first introduce general convolutional neural networks (CNNs) [23], [24] into video frame interpolation. However, severe artifacts and blur are unavoidable when the CNNs directly synthesize interpolated frames. In this context, [25] propose deep voxel flow to warp the input frames based on triple sampling, which produces low blur but performs insufficiently in sceneries with

Proposed Method

Given two input EM frames I0 and I2, that are continuous in the z-axis, our goal is to synthesize the corresponding intermediate frame Î1. To accurately extract the deformation field from the complex EM images and deal with the unstable image quality, we propose a novel spatio-temporal distilled interpolation framework, which progressively aggregates the temporal content and spatial-related information. We first encode the input feature maps: F0 and F2, using the feature extractor with a

Implementation Details

As for the implementation of this study, we randomly crop a triplet of EM image patches with the size of 256×256 is randomly cropped the odd-indexed two frames are used as inputs while the corresponding frame is employed as supervision. For data augmentation, this study randomly rotates 90°,180° and 270°, flips them horizontally and arbitrarily reverses their temporal order.A Pyramid, as well as Cascading and Deformable (PCD) architecture in [17] are used to employ temporal deformable alignment

Evaluation from a Biology Perspective

As stated in Section 1, the objective of EM slice interpolation is to increase Z-axis resolution, decrease anisotropy and hence improve volume reconstruction. With the rapid evolution of biomedical segmentation, large-scale volume reconstruction [44] is now able to rebuild biological tissues of human interest, such as membranes, mitochondria and synapses. Consequently, the proposed method is evaluated from a biological perspective through biomedical segmentation. More specifically, the membrane

Conclusion

This study develops a framework for interpolating EM images with complex deformations and unstable quality. This framework comprises two primary modules: one for spatio-temporal fusion and the other for feedback distillation. The spatio-temporal ensemble module estimates spatial correlation coefficients and samples similar textures based on temporal features to maintain edge continuity. Due to the inherent mismatches of temporal features, a stackable feedback distillation module is proposed for

CRediT authorship contribution statement

Zejin Wang: Methodology, Visualization, Formal analysis, Writing - original draft. Guodong Sun: Data curation, Writing - review & editing. Guoqing Li: Data curation, Writing - review & editing, Supervision. Lijun Shen: Writing - review & editing, Supervision. Lina Zhang: Visualization, Data curation. Hua Han: Resources, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Science and Technology Innovation 2030 Major Program (2021ZD0204503, 2021ZD0204500), the Strategic Priority Research Program of Chinese Academy of Science (No. XDB32030208 to H.H.), International Partnership Program of Chinese Academy of Science (No. 153D31KYSB20170059 to H.H.), Program of Beijing Municipal Science & Technology Commission (No. Z201100008420004 to H.H.), National Natural Science Foundation of China (No. 32171461 to H.H.), and the Strategic

Zejin Wang received the B.S. degree in School of Electrical and Mechanical Engineering from Wuhan University of Technology, Wuhan, in 2018. He is currently pursuing his Ph.D. degree in the Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include image restoration, video interpolation and self-supervised learning.

References (47)

  • W. Li et al.

    Video interpolation using optical flow and laplacian smoothness

    Neurocomputing

    (2017)
  • C.S. Xu et al.

    Enhanced fib-sem systems for large-volume 3d imaging

    Elife

    (2017)
  • T. Xue et al.

    Video enhancement with task-oriented flow

    International Journal of Computer Vision

    (2019)
  • W. Bao et al.

    Depth-aware video frame interpolation

  • X. Xiang, Y. Tian, Y. Zhang, Y. Fu, J.P. Allebach, C. Xu, Zooming slow-mo: Fast and accurate one-stage space-time video...
  • A. Dosovitskiy et al.

    Flownet: Learning optical flow with convolutional networks

  • E. Ilg et al.

    Flownet 2.0: Evolution of optical flow estimation with deep networks

  • A. Ranjan et al.

    Optical flow estimation using a spatial pyramid network

  • D. Sun, X. Yang, M.-Y. Liu, J. Kautz, Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume, in:...
  • S. Niklaus et al.

    Video frame interpolation via adaptive convolution

  • S. Niklaus et al.

    Video frame interpolation via adaptive separable convolution

  • S. Niklaus et al.

    Context-aware synthesis for video frame interpolation

  • H. Jiang et al.

    Super slomo: High quality estimation of multiple intermediate frames for video interpolation

  • W. Bao et al.

    Memc-net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2019)
  • J. Dai et al.

    Deformable convolutional networks

  • X. Zhu et al.

    Deformable convnets v2: More deformable, better results

  • X. Wang, K.C. Chan, K. Yu, C. Dong, C. Change Loy, Edvr: Video restoration with enhanced deformable convolutional...
  • Y. Tian, Y. Zhang, Y. Fu, C. Xu, Tdan: Temporally-deformable alignment network for video super-resolution, in:...
  • K. Simonyan et al.

    Very deep convolutional networks for large-scale image recognition

    ICLR

    (2015)
  • T.-Y. Lin et al.

    Feature pyramid networks for object detection

  • S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, W.-C. Woo, Convolutional lstm network: A machine learning...
  • G. Long et al.

    Learning image matching by simply watching video

  • K. He et al.

    Deep residual learning for image recognition

  • Cited by (2)

    • Exploring the Neural Organoid in High Definition: Physics-Inspired High-Throughout Super-Resolution 3D Image Reconstruction

      2023, 2023 Asia Communications and Photonics Conference/2023 International Photonics and Optoelectronics Meetings, ACP/POEM 2023

    Zejin Wang received the B.S. degree in School of Electrical and Mechanical Engineering from Wuhan University of Technology, Wuhan, in 2018. He is currently pursuing his Ph.D. degree in the Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include image restoration, video interpolation and self-supervised learning.

    Guodong Sun received the B.S. degree in automation from North China Electric Power University, Beijing, China, in 2019. Now he is pursuing his master’s degree in the Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include instance segmentation and active learning.

    Guoqing Li received the B.S. degree from Beijing Jiaotong University, Beijing, China, in 2007 and the Ph.D. degree in National Space Science Center, Chinese Academy of Sciences, Beijing, China, in 2012. He is the image algorithms senior engineer in Hermes-Microvision Ltd. in 2013–2015. Now he is an Assistant Research Fellow with Institute of Automation, Chinese Academy of Sciences. His research interests include electron microscope imaging system, image and video processing and structure analysis and modeling of brain.

    Lijun Shen received the B.S. degree from Northwestern University, Xi’an, China, in 2004, the M.S. degree from Inner Mongolia University of Technology, Hohhot, China, in 2010 and the Ph.D. degree from Macau University of Science and Technology, Macau, China, in 2021. He is currently an Assistant Research Fellow with Institute of Automation, Chinese Academy of Sciences. His research interests include massive data management and distributed computing.

    Lina Zhang received her M.S. degree from China University of Geosciences (Beijing), major in materials engineering, in 2017. She is currently a microanalysis engineer, engaged in the acquisition of high-throughput microscope images in the field of brain science and material science.

    Hua Han received the B.S. degree from Xi’an Jiaotong University, Xi’an, China, in 1996, the M.S. degree from Chinese Ship Research and Development Academy, Beijing, China, in 1999 and the Ph.D. degree from Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2004. He is currently a Professor with the Institute of Automation, Chinese Academy of Sciences, a member of CAS Center for Excellence in Brain Science and Intelligence Technology, and a Professor with the Future technological college, University of Chinese Academy of Sciences. His research interests include image processing, computational neuroscience and pattern recognition.

    View full text